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When I first began collecting SNOBOL4 programs for a book, I 
had two major misgivings. First, I wondered whether there 
would be enough material and second, I wondered whether the 
programs would be sufficiently nonobvious to warrant publica- 
tion. Both fears slowly evaporated. On the one hand, the 
range of SNOBOL4 applications is as wide as the spectrum of 
computer uses and this, it seems, is well-nigh inexhaustible. 
Indeed, an entire book of algorithms and algorithmic techni- 
ques has recently appeared [Aho et al, 1974] in which the 
range of applications and techniques when intersected with 
that of my own book approximates the empty set. It gives one 
pause to contemplate the complement of both sets. In the end, 
I had a considerable amount of material left over and so my 
one fear was baseless. 


As to my other concern, I was happy to discover in the course 
of writing the book many new and nonobvious ways of program- 
ming in SNOBOL4 (not all of my own discovery) so that I can 
now be confident that the collection of routines are more than 
merely exercises in the use of the language. Indeed, some 
routines or techniques were previously believed to be im- 
possible to write in SNOBOL4. For example, employing SNOBOL4 
patterns directly in the compilation process, dynamically 
loading SNOBOL4 functions on a call basis, and determining the 
compilation numbers of statements compiled at execution time 
are three problems encountered during the development of 
production programs which were previously thought simply not 
doable in the language. These are relatively easily achievable 
by techniques described in this book (see Programs L_ONE 
(18.2), DEXTERN (14.2) and LPROG (11.5) respectively). Since 
I have been a SNOBOL programmer for over a decade and since I 
am still discovering how to do things in the lanquage, the 
reader may conclude either that I am a dunce or that the 
designers of SNOBOL4 have created a very flexible and powerful 
language that deserves further study and wider use. The 
remainder of the book will convince him, I hope, that it is 
the latter and not the former. 


Another, less prominent, concern was the relative obscurity of 
the SNOBOL4 language. While more widely used and available 
than most languages, it is not so ubiquitous as say Fortran or 
Cobol. For a variety of reasons such as cheaper machines it 
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is not hard to visualize a future in which SNOBOL4, or at 
least a SNOBOLY-like approach to life, will play a more promi- 
nent role. Also the quest for simplicity of programming may 
ultimately be achieved by way of semantic richness rather than 
by feature elimination. 


Viewed most generally, the book is a collection of algorithms 
with SNOBOL4 used as a communication vehicle. The algorithms 
are decidedly oriented toward the mnonnumerical as this is 
SNOBOL4's forte and as such tend to supplement other published 
algorithms such as those appearing in the Communications of 
the ACM which, due to the reliance on Fortran and Algol, are 
primarily mathematical in nature. Because of its nonnumerical 
character, the book should be especially helpful to artisans 
in the humanities and in business applications as well as to 
the information scientists to whom the work is primarily ad- 
dressed. The reader is assumed to know or be learning SNOBOL4S 
and if his knowledge in this respect is a little weak he 
should be willing to consult an appropriate manual or primer 
for reference. Little or no assumption is made with respect 
to his knowledge of other areas of computer science and 
mathematics. 


As a collection of SNOBOL4 algorithms, the book lends itself 
for direct use by the growing number of SNOBOL4 programmers 
who may use the programs as is, or modify them to suit their 
particular application. To further this end, virtually all 
programs are written as functions with a conscientiously ap- 
plied naming system so that they can be simply ‘plugged in' to 
existing programs without disturbing things. Hence another 
purpose is served, i.e., to foster and illustrate a technique 
of well-structured modular programming which is all too fre- 
quently lacking in many SNOBOL4 programs. There is currently 
great interest and for good reason in goto-less structured 
programs and while the control structures of SNOBOL4 prohibit 
adherence to the letter of this dictum, the examples in this 
book serve to carry out its spirit. 


The SNOBOLS programmer will find much information of an im- 
plementation nature not available elsewhere. Most of this is 
intended to guide him in the writing of more efficient 
programs but some SNOBOL4 lore is included for his general in- 
formation. An effort has been made to describe pattern mat- 
ching more fully and comprehensively than it has been 
heretofore as this has been one of the murkier aspects of the 
language. 


Finally, the large number of SNOBOL4 example programs’ should 
complement well a SNOBOL4 primer or manual in teaching the 


lanquage. This author's experience has been that programming 
languages as well as natural languages are most easily taught 
by varied and intriguing examples. Not only is interest 


heightened and motivation increased, but the example carries 
the student forward on a familiar framework and provides a 
convenient gestalt for later recall. Because of this use as a 
supplementary text, various features of the language are com- 
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partmentalized in the early chapters so that their introduc- 
tion can be synchronized with a course of instruction. In fact 
the author has used notes from this book very successfully in 
teaching a course in nonnumerical programming to members of 
the staff at Bell Laboratories and to graduate students at 
Stevens Institute of Technology. A number of exercises have 
been included to extend its usefulness in the classroom as 
well as to suggest possible modifications of the routines 
themselves. 


The alert reader will note that the book was prepared by a 
computer. This was done to permit the automatic testing of 
the programs. To remain faithful to this idea, all figures, 
titling, paragraph illumination, etc. were done without suc- 
cumbing to the temptation of later touchup. Chapter 10 
describes in detail some of the routines used in the book's 
production. 


The programs, aS presented, are directly applicable to the IBM 
360 implementation of SNOBOL4 and SPITBOL. In virtually all 
cases, these programs can be used with SNOBOL4S processors 
(including SITBOL) on other machines without change or, at 
most, by a transliteration of characters. 


The writing style has been chosen to be direct, informal and 
sometimes even cheerful. It is hoped that occasional lapses 
into whimsy (not expunged by the final version) do not disturb 
the reader; the intent is not so much to amuse as to present 
a welcome relief to the frankly difficult task of reading and 
interpreting programs. 


A number of individuals have contributed in one way or another 
to the production of this book. Thanks go to Frank Boesch, 
Len Bosack, Fran Brophy, Steve Chen, Bob Dewar, Ralph 
Griswold, Scott Guthrey, Dave Hanson, Cass Lewart, J. C. Noll, 
Ivan Polonsky, Mark Rochkind, Larry Samberg, Dick Stone, and 
Jane Walsh. A special appreciation goes to Ralph Griswold who 
taught a Computer Science course at the University of Arizona 
from an early computerized draft of Chapters 2-5 and provided 
valuable feedback. I am flattered that he was able to expand 
on this material to produce an excellent and very readable 
book [Griswold 1974a]. Those having difficulty reading the 
early chapters here may wish to consult this text. 


Finally, thanks go to the management and staff of Bell 
Laboratories whose consent, cooperation and computers have 
made this text possible. 


James F. Gimpel 
Holmdel, New Jersey 
May 1, 1975 
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%% lgorithms and Programs | An algorithm is a_ sequence 


{ 

1s £ ———————————— of self-evident steps for 
{%& € { carrying out some activity. A familiar example of 
{ #%8% | an algorithm is the procedure for ‘long' multiplica- 
{% € | ‘tion which multiplies two numbers which are bigger 
i—-_-_-J than the operands in a memorized table. The notion 


of algorithm is actually quite old going back several thousand 
years B.C. (Knuth 1972], and the word ‘algorithm’ has a _ long 
- and convoluted etymology [Knuth Vol.1, p. 1-2]. 


We say an algorithm is composed of "self-evident steps" to 
rule out some such phrases as "add salt to taste", or "apply 
sward to mainskee according to Fig. 3". That is, each step 
can be mechanically carried out without assistance from a 
human being. But it is interesting to note that the definition 
of algorithm is not a rigorous one, since no one can ever give 
an all-inclusive definition of “self-evident step". What we 
generally do is devise a special language within which each 
operation is carefully defined, and this language is used to 
express all algorithms. Thus we can devise a special machine 
language as was done by Knuth [Vol. 1-3], or we may devise a 
matching and replacement operation as was done by Markov 
{1954], or invent a dialect of some existing language, such as 
Pidgin ALGOL [Aho et al, 1974], or we may use an existing 
programming language, such as is used in the Algorithms sec- 
tion of the Communications of the ACM. In this book we will 
use an existing language, viz. SNOBOL4 [Griswold et al, 1971]. 


This means that our collection of techniques are not merely 
algorithms, they are programs as well. Since there is some 
question (not to mention controversy) as to the distinction 
between algorithm and program [ACM Algorithm Letters, 1966 and 
ACM Forum, 1974-1975}, it is perhaps worth our trouble to 
consider these two notions. An algorithm is a method, distinct 
from any external form, and distinct from any language. on 
the other hand, a program is a sequence of characters which 
will implement some process. For example, we may say that a 
program is 332 characters long, but we may not say such a 
thing about an algorithm, because an algorithm may be im- 
plemented in several different languages producing programs of 
various lengths. To communicate the algorithm to another human 
being, we generally require its formulation in terms of 
concrete symbols. Any such formulation may be said to be a 
program. Hence, on the surface at least, the notions of al- 
gorithm and program would seem to bear the same relationship 
to each other as the notions of function and expression in 
mathematics. That is, one is a representation of the other. 
However, the analogy is somewhat imperfect. Programs are 
generally written to be run on a digital computer, and, as 
such, tend to communicate an algorithm to a machine, as op- 
posed to another human being. Programs are a medium whereby a 
process is effected, and hence are, as it were, part of the 
machinery. We may therefore expect them to reflect 
idiosyncrasies not part of the original pure algorithmic no- 
tion. That is, programs may be dirty. On the other hand, 
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programs, when coupled with an appropriate linguistic proces- 
sor, can actually carry out the activity for which they are 
designed. In short, they work. 


Although in principle an algorithm is independent of the par- 
ticular language in-which it is expressed, in practice, this 
is an impossibility. This is because, as the notion of self- 
evident step varies, the techniques employed to carry out an 
overall activity will vary. Thus, a method to compute a hash 
function will depend on what arithmetic operations (such as 
division) are available. Random number generators will depend 
not only on what operations are present, but on whether some 
forms of arithmetic overflow are permitted. Certainly, string 
algorithms implemented in a Markov language such as SNOBOL4G, 
which permit string scanning as a fundamental operation, will 
appear entirely different than when written in some other 
language. This is unavoidable and is, of course, one of the 
purposes of a text like this one. 


There is currently heightened interest in both algorithms and 
in programs. For example, there is a famous problem in graph 
theory called the Koenigsberg Bridge Problem. The problem 
calls for a path leading across all edges (bridges) of a graph 
without traveling along any edge twice. A constructive 
procedure for finding such a path was furnished by Euler in 
1736; this has long been regarded as the starting point of 
modern graph theory. However, it was not until 1973 {Edmonds 
and Johnsonj that anyone specified a method for finding such a 
path in an amount of time proportional to the number of edges. 
This particular example is only typical of a general trend. 
We are no longer content with knowing that a procedure can be 
carried out, nor even with how such a procedure can be carried 
out. The thrust of much computer science activity is in deter- 
Mining how effective a particular algorithm is, and in care- 
fully specifying an algorithm to maximize efficiency. 


Another area of waxing interest is in determining the proper 
form of a program. Virtually unheard of five years ago, the 
term 'structured programming! has captured the fancy of the 
computing fraternity and, at this writing, is perhaps the most 
used (and abused) term in the literature's lexicon. While the 
term means many things to many people, the general idea is 
that many of the ills plaguing the software industry are 
traceable to the fact that we are incapable of properly struc- 
turing large complex tasks. While we can study the strategy 
of structuring from a language-independent point of view, many 
of the tactics in forming clear and cogent code depend on the 
particular tools at one's disposal. Hence, another purpose of 
this text is to discuss and present methods of organizing, 
i.e., structuring, SNOBOL4 programs. 
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{ #8%% NOBOL4 ORIGINS | Programs written in SNOBOL4 tend to 
'* oS obbe)3S sol oriented toward the manipulation 
1 88EH | Of strings. A string is a sequence of characters 
] ® { and a character is any of the various letters, 
{ B88 | digits, logograms and punctuation symbols (including 
t—__--_4 the blank) that one might punch on cards or type on 
an electronic terminal. The stream of characters you are 
reading now is an example of a string. It has, in fact, been 
subjected to some of the algorithms to be described in this 
book. 


String processing includes the testing, comparing, scanning, 
rearranging, transliterating, transforming, inserting, 
crunching, and deletion of strings. Since programs and data 
are normally entered into a digital computer in the form of 
strings and since all data printed is in this form, it might 
seem that string processing is, and always has been, in the 
forefront of computer studies. . But this is hardly the case. 
Historically, string processing has been something of a_ step- 
child of computation. 


The computer was initially perceived as a machine whose 
primary purpose was performing numerical computations. Getting 
numbers and programs into the machine was considered inciden- 
tal to computing rather than occupying any central role. In 

fact, tO program an early machine, one did not use characters 
at all, but wired up a plug board. A single program took weeks 
of effort. Humans began to realize that they were more like 
Slaves to the machine than high-priests as they were forced to 
do an inordinate amount of work just to keep the machine busy. 
Alt [1972] recalls that, as early as 1947, the team of 
programmers for the ENIAC discovered a method whereby they 
could enter programs by merely dialing digits rather than 
wiring plug boards. To do this they wired the plug-board con- 
trol permanently in such a way that the machine read the 
digits and performed associated instructions in much the same 
way that a modern interpreter might do. This seems to be the 
world's first higher level lanquage. At any rate, the machine 
slowed by a factor of five but the technique was the preferred 
one thereafter. Why? Was it because men are lazy and they 
want the machine to do all the work? Well, there is a way to 
express this less argumentatively. The machine was so success~ 
ful at performing arithmetic that the bottle-neck shifted away 
from calculations with numbers to the logistics of presenting 
the problems to the machine. In many ways this problem is 
still with us. 


Peripheral devices for reading characters from paper tape and 
cards had existed for some time and it did not take long 
before such devices were attached to the machine for 
input/output. More importantly, machines were beginning to be 
designed with the stored-program concept which meant that plug 
boards did not have to be wired for each different program. 
‘Rather, like the trick used with the ENIAC, the machine would 
translate numbers into instructions, but with the important 
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difference that the numbers did not have to be set manually. 
They could be read from some external device or they could be 
computed; in particular, they could be produced by some other 
program and the Great Age of computer languages was born. From 
this point on, the evolution of machine design gave way to an 
evolution of languages, in much the same way that human 
biological evolution has given way to a cultural evolution. 
Although the components have changed to give us cheaper, smal- 
ler, more efficient machines, the machine organization has 


remained essentially the same (the Von Neumann Machine). In 
this organization main storage consists of an aggregate of 
words each addressable by some assigned number. The data 


within this storage is entirely unstructured as seen by the 
hardware. Complex data such as strings, patterns, arrays, etc. 
are only such in the eyes of the software, not as viewed by 
the hardware. 


The first programming languages were, of course, assembly 
languages in which generally there is a one-to-one correspon- 
dence between lines in the source language and machine 
instructions. The assembler's job is essentially to translate 
from names (suitable to humans) to numbers (suitable to 
machine). This is unnatural for a machine to do and it was 
resolved essentially by a mechanism known as a symbol table 
(see Chapter 11). The use and disposition of a symbol table 
is key to the implementation and understanding of many 
programming languages in addition to assemblers. 


A rather impressive advance was made by the Fortran language 
which was developed in the mid-1950's. This language was. so 
well designed that today it is perhaps the most widely used 
programming language in spite of regular denunciations by the 
academic community. Fortran opened up computation to a large 
number Of programmers who would need to know nothing or very 
little of the internal organization of the machine in order to 
start programming (although they usually wind up having to 
know a great deal). Now an important point to note in connec- 
tion with Fortran is its peculiarly numerical orientation. The 
tools provided to the Fortran programmer were totally dif- 
ferent than the tools required by the system programmers who 
had to write assemblers, operating systems and the Fortran 
compiler itself. Fortran had, for example, a rich mathematical 
library containing trigonometric functions, exponentiation, 
etc. which the writers of Fortran had absolutely no need for; 
on the other hand, Fortran lacked string, character, bit and 
address data objects which are essential to ‘systems! work. 
Although a step away from the numerical was made in that the 
language gave the machines the ability to accept programs in 
human style, it was assumed that the end use would be ‘number 
crunching'. 


The first non-numerical language of consequence was IPL 
[Newell 1957]. This language was developed as a by-product of 
some experiments in artificial intelligence by Newell, Shaw 
and Simon in which an attempt was made to mimic the thinking 
patterns of human beings. In particular, the mental processes 
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involved in theorem-proving were explored [Feigenbaum and 
Feldman 1963]. IPL is a list-processing language. All data 
is in the form of lists; the components of a list may be other 
lists or basic non-decomposable units which are actually ad- 
dresses referenced symbolically as in an assembler. Numerous 
built-in functions are available to manipulate lists. In fact, 
an IPL program is itself a list. The arch-difficulty of IPL 
is its syntax which is forbiddingly like .assembly language. 


IPL was soon followed by LISP [McCarthy 1960] which overcame 
some of the syntactic difficulties of IPL. Rather than place 
components of a list vertically down the page with symbolic 
reference to sublists, LISP provided a more abbreviated 
horizontal notation with nested parenthetical expressions to 
denote sublists. Moreover, the basic nondecomposible unit, 
called the atom in LISP, was a string. In LISP, large strings 
were represented as lists of atoms, and atoms, as their name 
suggests, could not be decomposed. 


A list was the first data object whose size was not fixed for 
the duration of the program but which could vary as required. 
Lists are particularly useful in problem areas which are not 
well understood and cannot, or at least, have not been reduced 
to easily computable mathematical formulas. Hence list struc- 
tures have been a favorite form of data for artificial intel- 
ligence applications. 


COMIT is often considered the first true string processing 
language. Unlike LISP, the strings of COMIT can be arbitrarily 
manipulated not by rearranging pointers between fixed strings 
but by completely rearranging the characters (and hang the 


cost). With COMIT the string had become a data object; a 
variable (of sorts) could range over the entire set of 
Strings. These variables were called ‘shelves' and were 


referenced by shelf number. A very powerful process called 
pattern matching could be applied to such strings and matched 
substrings could be replaced by other strings. COMIT has one 
major deficiency; one may not use ordinary common names. such 
as S, LIST, or BILL to denote variables as one might do with 
numerical variables in Fortran or even assembly language. 


The pattern matching notation entered COMIT by way of 
linguistics where the notation is quite old. The notation was 
studied in depth by Markov [1954] who treated the replacement 
operation as a fundamental algorithmic component and _ showed 
that all computations were possible using replacement alone. 
Languages such as COMIT and SNOBOL4 are sometimes referred to 
as Markov languages even though there is no evident historical 
connection. 


Early work at Bell Laboratories in string processing included 
the development of a language called SCL (Symbolic Communica- 
tion Language) by Lee, et al [1962}. SCL extended the 
facilities of COMIT for string processing but had _ several 
deficiencies including an ungainly assembly-language syntax 
and the absence of variable names (as in COMIT). SCL had cer- 
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tain unique and valuable features such as a run-time compila- 
tion and execution of strings, but its most valuable contribu- 
tion was that it provided a gestation period for SNOBOL. 


SNOBOL [Farber et al, 1964] combined two very important ideas, 
the string processing and pattern matching of COMIT and the 
symbolic referencing of variables. Thus for the first time in 
any major language (and possibly ever), a programmer could 
write: 


A = BC 


to indicate in a simple and natural way that the string B 
concatenated with the string C is to be assigned to the string 
A without disturbing the values of either B or C. The pattern 
matching operation of COMIT could be invoked in a_ similarly 
convenient and concise fashion. Thus for the first time, 
strings of characters could be manipulated with the notational 
ease that Fortran provided for numbers. 


Unlike Fortran, however, no simple easy translation existed 
into machine orders. On the IBM 7090, on which SNOBOL was 
first implemented, concatenation was a complex process. re- 
quiring the shifting of characters through an ungainly 
accumulator. Also, the use of variables whose values cannot 
be destroyed complicates further. the operation of concatena- 
tion. Thus, we cannot merely direct a pointer from B to C to 
effect the above concatentation as this would alter B. We 
cannot copy cC onto the tail end of B as this would destroy 
other data. Rather, a separate section of core is allocated, 
the strings B and C are copied in, and a pointer is directed 
from A to the new storage. Since storage is being generated 
continuously, a process of storage recovery (garbage collec- 
tion) is required. Thus, the apparent simplicity requires a 
rather considerable software system to support it. It is not 
surprising that it appeared relatively late on the programming 
scene. 


SNOBOL's successors, SNOBOL3 [Farber et al 1966] and SNOBOL4 
{Griswold et al 1968], while retaining the simple and powerful 
notation of the original SNOBOL, greatly extended and 
generalized its facilities. In fact, it is no longer accurate 
to characterize SNOBOL4 as a string language, since its 
facilities extend considerably beyond string manipulation. 


po ee oe oe 

| *#% he Future | How well may we expect SNOBOL4 to fare in 
1 & -———— the future? Certainly, this is an in- 
1 & | +j%+triguing question to ask of any language and one 
1 & { which is extremely difficult to answer. To a first 
{ & (| approximation, the success of the language will 
i—__._45 depend on the future importance of nonnumeric data 


processing. Although numerical programming will doubtlessly 
increase in the future, non-numerical processing should 
increase even faster. This is due to the economics of the 


situation. A computer can multiply two 8-digit numbers 
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together in approximately 6 microseconds whereas it takes a 
human about 60 seconds. The computer is therefore 107 times 
(or 7 orders of magnitude) faster at this activity than 
humans. On the other hand, to take a typical string-processing 
problem, a computer, carefully programmed, will require about 
two millisconds to scan a paragraph containing 1000 characters 
for some string such as 'ALPHA', whereas a human will require 
approximately 20 seconds. Hence, the machine for the non- 
numeric problem is only 10* (or 4 orders of magnitude) faster 
than the human. Hence, the machine is better at numerical 
processing by about 3 orders of magnitude. Since historically 
computers have been much more expensive than humans it is un- 
derstandable that they have been applied mostly in those areas 
with a strong arithmetic flavor. 


Another factor to consider in comparing the two kinds of 
processing is input/output (i/o). Two numbers that are mul- 
tiplied together typically do not come from typed data but are 
the result of other computations within the machine. But the 
string that is being scanned for the word ‘ALPHA’ has 
generally entered the machine from some i/o device such as 
disk, tape or terminal. If we consider disk as typical we find 
that this device transmits 10,000 characters in a total time 
of about 100 milliseconds so that our paragraph to be scanned 
requires 10 milliseconds. Multi-programming operating systems 
help somewhat to alleviate the problems of delay time due to 
disk i/o by transferring control to another resident program 
while i/o is in progress but the program doing i/o must remain 
resident in main storage thereby consuming resources. If we 
add a factor for the inefficiency of the transfer of control 
process and the time expended in transporting the characters 
from the main storage receiving stations (i/o buffers) into 
work areas we arrive at a figure very much like ten mil- 
liseconds anyway. The net effect is that if the string to be 
scanned is also read and written we increase the cost of 
string processing by another order of magnitude. 


Another difficulty with string processing that has helped hin- 
der its more rapid development is that string operations are 
by no means standardized at the machine level. Thus, string 
processing is not only slower, it is more complicated. In 
Fortran, the statement: 


x = Y*@Z 


results in three instructions, LOAD Y, MULTIPLY by 2, and 
STORE into X. No such corresponding instruction sequence can 
be produced for typical SNOBOL4 operations such as pattern 
matching or concatenation. Not only do these operations re- 
quire more instructions but the methods vary from machine to 
machine. To begin with, the method of representing strings 
varies [Madnick 1967]. Representational decisions such as 
whether to store one character per word or several characters 
per word may depend on machine characteristics such as whether 
characters are directly addressable. Another important dif- 
ference is how string values are bound (assigned) to 
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variables. For example, in PL/I the only very efficient string 
representation is to allocate a given storage area of maximum 
size for each string variable. On the other hand, an implemen- 
tation of the SNOBOL4 language requires that a pointer be 
associated with each variable which points to the actual 
characters. This may seem like a minor difference but it is 
not; in the PL/I approach a simple string assignment such as: 


S1 = S2 


results in copying the string. In SNOBOL4, only the address 
is copied. However, the latter method implies the necessity 
to garbage collect whereas the former does not. That is, if 
S1*s pointer is overwritten by another pointer, the old string 
pointed to by S1 may no longer be needed. Experience shows 
that we cannot afford the luxury of retaining every string 
ever referenced in a string-processing application, and so, 
obsolete strings must be discarded. 


Even fixing on a common data representation, the method of 
scanning a string S_ for a substring, say 'ALPHA', can vary 
considerably. The IBM 360/370 contains a TRT* instruction 
which enables the machine to quickly scan a string for one of 
a set Of characters. Thus, we might rapidly scan the string S 
for the lead character 'A' thus increasing the scan rate. But 
time is required to set up this rapid scanning. For short 
strings or for strings containing many A's it would be more 
economical not to use this special scan. Even given the rapid 
scan ability, it is not clear that 'A' should be the character 
searched for. If we assume that P's occur less frequently than 
Ats then a rapid scan for the letter 'P' should be made. Given 
any such 'P' we can then check for the characters ‘AL! 
directly before and 'HA' directly after. 


The setup tradeoff is not unique to the 360/370 architecture. 
For many machines a fast inner loop can be written to test for 
a specific character that will be faster than a loop to test 
for an arbitrary character (which is, say, ina register). If 
one is willing to invest time in forming characterizations of 
the subject string (the string being scanned) one can perform 
a kind of hash test [Harrison 1971] which is very fast. This 
is inefficient, however, unless the subject string will be 
scanned repeatedly. 


The complexity involved in specifying string algorithms 
becomes significant in several ways.. The languages for string 
processing must call functions rather than compile in-line 
code and the linkage overhead further slows down computation. 
In fact, most implementations tend to be interpretive which 
greatly reduces the speed of numerical operations if, for sim 
plicity, these are also treated interpretively. Compl ex 
language processors cannot be built as rapidly and any string 


*TRT stands for TRanslate and Test. This is a misnomer; ‘Scan 
and Test would have been better. 
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language will experience more difficulty in being reproduced 
on some other machine. When a processor, such as the macro 
implementation of SNOBOL4, attempts to be machine-independent, 
it must sacrifice efficiency significantly. For example, the 
Macro implementation of SNOBOL4 will scan a string for a_ sub- 
string at the rate of 40 microseconds per character (on the 
IBM 360/Mod 65) a full order of magnitude slower than is 
possible on that machine essentially because of its machine 
independence. The most efficient utilization of any machine 
for typical string operations requires in general a complete 
restructuring of the program and this tends to inhibit the 
rapid spread of any language. 


The complexity issue becomes important when one realizes that 
the very great strides in producing economical computation in 
the last several years have come in the form of minicomputers 
and microcomputers. These machines tend to be small, new and, 
as is characteristic of a new industry, exhibit a relatively 
large number of different designs. All three factors tend to 
work against a large ambitious SNOBOL-like language. 


As the early ENIAC programmers discovered, however, very few 
problems are so purely numerical that the machine can be 
casually fed problems and spew out answers. In fact, most of 
what mankind wants done is non-numerical and is difficult if 
not impossible to program. By contrast, those problems which 
are very numerical have probably already been programmed or 
are embedded so intricately in an essentially non-numerical 
setting that the numerical part can't be brought easily to the 
machine. To consider just one example, the filling out of 
one's income tax can be done conversationally from a computer 
terminal; the amount of computation that must be performed is 
insianificant compared to the total programming required to 
Make the system usable by the ‘unwashed! (naive) user. Hence, 
if we are to extend the application of computers to new areas 
there will surely be much about these areas that is non- 
numerical. . 


ee a a ee ee eee ee eee es 

| ##%% NOBOL4 Implementations | SNOBOL4 was developed during 
' % tod period of computer 
{ 88% | changeover at Bell Laboratories and so the language 


| € { was written in a system of macros [Griswold 1972]. 
{ #888 | In this way, the language could relatively easily be 
u___—1 transported to the new machine (whatever it was 
going to be). This had the fortunate consequence of making 
SNOBOL4 transferrable to other different machines with far 
less difficulty and with much greater faithfulness to the 
Original design than would otherwise have been possible. This 
implementation is usually referred to as the MAcro 
Implementation of SNOBOL4; we will refer to it throughout as 
MAINBOL. 


While MAINBOL is relatively portable, it is also inefficient. 
This is due primarily to its machine independence. A fair 


estimate of the cost of machine independence in the case of 
SNOBOL' is a factor of two in both space and time. 


SPITBOL [Dewar 1971] was developed to overcome the inef- 
ficiencies of SNOBOL4S, at least for the IBM 360. By writing 
exclusively in assembly language, by developing new techniques 
for string handling and storage management, and by compiling 
executable code rather than running interpretively, SPITBOL 
was able to better the running speed of MAINBOL by a factor of 
7 (this was a median figure of 21 programs tested at Bell 
Laboratories). SPITBOL is also smaller than MAINBOL by a fac- 
tor of two. It should also be pointed out that SPITBOL not 
only did not compromise with the language which so often hap- 
pens when a language is reimplemented from scratch, but 
actually extended the language in several significant ways. 


The SITBOL processor [Gimpel 1973a & 1974] is a completely new 
implementation of the SNOBOL4 language for the PDP-10. SITBOL 
benefitted greatly from the SPITBOL experience, using and im- 
proving upon the implementation innovations of SPITBOL. 
Although SITBOL is an interpreter, it is faster than MAINBOL 
by a factor of from 3 to 5 and is smaller by a factor of 3. 
SITBOL is upward compatible with both SNOBOL4 and SPITBOL and 
contains many language enhancements as well. These three im- 
plementations are discussed more fully in Chapter 11. 


While these are the only implementations that can claim to 
Support a full SNOBOL4Y, the FASBOL implementation [Santos 
1971] should also be mentioned. This ambitious project is in- 
tended to produce a compiler for SNOBOL4 that, in addition to 
obtaining high speed, supports separate subroutine compila- 
tion, compiled patterns and in-line arithmetic. FASBOL, 
however, lacks several SNOBOL4 features and many of the 
programs in this book will therefore not run under that 
system. 


Gere en. ake Ne te he eae eee 

{ £#8%8 NOBOL4 foibles | Winston Churchill's famous statement 
1 # eC sabcoutt democracy can be made with 
{ 88% | particular aptness about SNOBOL4. It is the worst 
{ € | of all programming languages, except for all the 
{ 88% | rest. By this we mean that SNOBOL4Y is a very effec- 
ti tive programming language not because it is free of 
blemish, it actually has quite a few, but because of the many 
valuable features which it does have. In my own experience, 
unless the problem is totally numerical, a SNOBOL4 program 
will be at most half as large as one written in some other 
language to achieve the same effect. In some cases the reduc- 
tion in size and complexity is indeed dramatic. SNOBOL4& 
achieves this code condensation by providing a number of 
facilities simply not available in most other languages. These 
include pattern matching which is so rich as to amount to a 


language within a language. The storage allocation facility, 
while conceptually simple, completely frees the user from 
concern over the detailed disposition of data objects. All 


data objects are represented by a descriptor of fixed size. 
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This makes it possible to have heterogenous arrays, 
declaration-free variables and structures, and, most impor- 
tantly, it allows data objects to be freely transferred bet- 
ween calling and called functions. The historic tendency of 
interpreters to include symbol tables during execution leads 
to a number of facilities not normally available. These 
include indirect referencing, indirect goto's, dynamic defini- 
tion of functions and structures and, the ultimate source of 
freedom and flexibility, the ability to compile and execute 
arbitrary strings. It has a comprehensive tracing and error 
recovery facility and the ability, through numerous keywords, 
to provide the user with all sorts of information concerning 
his running program. 


In general, the power and flexibility of SNOBOL4 are une- 
qualed. While the language can be abused, as many languages 
can be, it has many features which, properly employed, enable 
large programs to be written with a minimum of difficulty. 


This is not to suggest that the language is entirely free of 
defect. As in any ambitious project of SNOBOL4's magnitude, 
there are many minor deficiencies. Moreover, merely knowing 
about them does the language designer no good. Liabilities 
get ‘frozent into a language since it is impolitic to make 
non-compatible changes. For casual SNOBOL4 programming we may 
ignore many of these deficiencies. When composing large 
programs, however, it is much more important to develop a 
systematic approach and we must confront these defects 
squarely. 


As remarked by Dunn [1973], a language which is very inef- 
ficient can be a burden to use even though the application, 
such as bootstrapping, is not nominally one demanding high ef- 
ficiency. Dunn was critical of SNOBOL4 in this regard but his 
remarks were actually directed to a specific implementation, 
MAINEOL. As Hanson [1973] remarks, the inefficiencies noted 
in using MAINBOL do not apply to SPITBOL and SITBOL. Our 
remarks in this critique will be directed only to the SNOBOL4 
language as described by Griswold et al [1971] and not to any 
particular implementation 


mies Perhaps the most noted deficiency of SNOBOL4, especially 
in an age when the goto is harangued daily, is the lack of 
good control structures. They are admittedly primitive 
{Griswold 1974}. There is no IF ... THEN ... ELSE, and no 
repetition element such as the Fortran DO. One is forced to 
use many goto's and to invent unique label names. This is a 
bother and conventions must be adopted. It is not, however, 
as detrimental to good programming practice as one might 
think, since it generates dependency on the use of the func- 
tion which is a superior control structure anyway. See the 
remarks on Structured Programming. 


2. A number of difficulties involve pattern matching. Pattern 
matching is a complex process and to be used fully requires a 
comprehensive understanding on the part of the user. For this 
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reason two chapters in this book are devoted to a_ theoretical 
and practical treatment of the subject. But aside from the 
learning problem there are residual difficulties. One of these 
is the one-character_ assumption which we discuss more fully in 
Chapter 7. The statement below: 


HERE S LEN(1) $C LEN(1) $ D ¥*LGT(C,D) = DC 2:S(HERE) 


should sort the string S as it repeatedly swaps any consecu- 
tive pair of characters not in the correct lexicographic 
order. Unfortunately, if the last two characters are out of 
order they are never swapped because the pattern matching 
mechanism assumes that *IGT(C,D) matches at least one charac- 
ter and that therefore the entire pattern requires at least 
three characters and that it would be a waste of time to try 
the pattern on merely two characters. The manual will say to 
use FULLSCAN mode to circumvent this but, as we will argue 
later, mode switching is not good practice for large programs. 


Predicates may be employed within patterns in spite of the 
one-character assumption if one employs a trick. See Prog. 
8.7. 


3. Another heuristic that gives problems is the length- 
failure, or futility heuristic. Under this assumption, the 
very natural back-referencing operation becomes virtually 
unusable. For example, the pattern matching statement: 


Ss LEN(3) $ X ARB *X 


examines the string S for a pair of identical three-character 
substrings, if it would only work. The first three characters 
of S are assigned to X and this string is searched for in the 
remainder of S. Upon failing, the next three characters of S 
should be assigned to X and the search should continue. This 
will not happen, however. When *X does not match by reason 
that there are insufficient characters remaining in S, it 
Signals ‘length failure' or ‘'futility' (See Chapter 7 for a 
more detailed discussion of these terms). The scanner believes 
that it can immediately halt all processing and so it does. 
The result is that, unless the first of the pair of three- 
character strings begins with the first character, the pattern 
fails. The error can be cured by FULLSCAN. As indicated in 
the preceding paragraph, however, this introduces other 
problems. 


4. Pattern building, as distinct from matching, also causes 
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some problems. The pattern matching statement: 
iS LEN(N) . K = 


removes the first N characters from the string S and assigns 
them to the variable K. Unfortunately, the pattern must be 
constructed each time the statement is executed. The cost of 
building the pattern with the concomitant garbage collection 
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will require more time than the pattern match itself. A solu- 
tion is 


= P = LEN(*N) . K 


Me « e 


Pp = 


Although this can serve to remove the pattern-building opera- 
tion from the tinner loop't, it creates several other problems. 
One has to think up a unique name (P just won't do in a large 
program) . The statement kearing the pattern definition is 
separated from the statement bearing the match. This can cause 
difficulties when trying to decipher a large program. The 
side-effect of setting the variable K without any apparent in- 
dication at the pattern match is poor practice. Finally, the 
use of *N is awkward. The novice tends to overuse the deferred 
expression and begins to use it where it produces errors. In 
short, the language becomes more confusing, difficult to learn 
and error rfrone. 


5. It should be possible in any language to write a function 
whose behavior will be invariant with respect to its environ- 
ment. The language that comes closest to this ideal is Fortran 
with its separately compiled subprogram. SNOBOL4 tends to be 
worse than others in this respect. For example, the function 
X(S), below, will return its string argument rotated one 
character to the right. 


DEFINE ('ROT(S) T*) : (ROT_END) 
ROT S RPOS(1) bEN() -T = 
ROT = T : (RETURN) 


ROT_END 


This function will behave properly provided (1) LEN, RPOS, 
binary '.' and concatenation have not been redefined, (2) 
RETURN has not been redefined, (3) the &ANCHOR mode has not 
been set, (4) ROT is not used as a label outside the program, 
and (5) neither ROT, S nor T have been I/O associated. 


6. SNOBOLS contains no block structure so that problems of 
scope emerge. For example, the function INC(NAME), defined 
below, will increment the named variable. Also, COUNT will 
record the number of times the function was called. 


DEFINE (* INC (NAME) *) : (INC_END) 
INC COUNT = COUNT + 1 
$NAME = $NAME + 1 : (RETURN) 


INC_END 


If COUNT is used outside the function, its current value can 
be destroyed. That is, there is no way to isolate this use of 
COUNT from any other that might exist in a program. One may 
designate that COUNT is local (a misnomer, ‘temporary! would 
be better) to the function. But this would mean that the value 
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of COUNT would be saved before entering the function and 
restored on return and hence could not be used to count the 
number of calls. 


The named variable being incremented ky INC may not be ar- 
bitrary. If it were COUNT, then it will be incremented twice. 
If it were INC, then it would be incremented once, but on 
return its old value would be restored. If it were NAME, there 
would be an attempt to add 1 to the string 'NAME' resulting in 
a fatal error. 


7. Function definition is unusually flexible in SNOBOL4, but, 
as has been noted by Abrahams [1974], it also leads to dif- 
ficulties. Since function definition is dynamic, the DEFINE 
must be executed; but where should it be placed? If the DEFINE 
is placed in some initialization section separated from the 
body of the function by some distance, programs become dif- 
ficult to follow. To place the DEFINE adjacent to the body of 
the function, which is good practice, it is necessary to use a 
hop-around construct as we have done above with ROT(A) and 
INC (NAME) . But this is trouklesome and wastes space. Execu- 
tion time and space is required for: (1) the string bearing 
the function prototype, (2) the code required for the DEFINE, 
the hop-around and the target of the hop, and (3) the string 
bearing the hop-around label. The third item above is ex- 
plained more fully below. 


8. By means of the indirect goto it is possible to do a multi- 
way branch. For example: 


: (STRIM (INPUT) ) 


will read a label and go to it. But this requires that every 
label must be in the symbol table at run-time. Not only must 
the physical characters of each label ke present but an amount 
of additional storage to house other data associated with a 
name. This additional information averages about 32 characters 
across several implementations. A 40-character storage penalty 
for each label is considerable for large programs. 


9. In SNOBOL4, INPUT/OUTPUT is markedly clean and uncluttered; 
but it generally lacks facilities. If one is only transmitting 
strings to sequential files, SNOBOL4 is adequate. However, no 
special facilities exist for printing columns of numbers or 
for doing direct-access I/0. Output media intended for human 
viewing is really two dimensional and merely outputting 
strings is inadequate. Although an extension to the language 
was made in this regard [Gimpel 1972a] space limitations have 
excluded it from most implementations. 


10. The statement 
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results in a strange error. One must write '0.1', not '.1', 
because unary '.' is an operator, which should be applied to a 
variable, not a value such as 1. 


11. There are several precedence anomolies. In virtually all 
programming languages, the operators '/' and **' have the same 
precedence and associate to the left. In SNOBOL4, '*' has a 
higher precedence than '/!. 


The precedence of concatenation is one of the lowest whereas 
it should be one of the highest. Thus, 


AB+t+c 
is parsed as A (B + C). 


The two highest precedence binary operators, viz. '=' and '?! 
associate differently. The first associates to the right and 
the second associates to the left. What is one then to make 
of: 


An-mB.? ¢ 


12. SNOBOL4 usurps the characters '<' and '>' for bracketing 
which renders them unusakle as operators. This means one must 
use the relatively primitive: GT(X,Y), GE(X,Y), etc. But 
square brackets are available, at least in ASCII, for the pur- 
pose and these are unused. 


13. The use of a blank to denote concatenation seems to force 
the language to require surrounding binary operators with 
blanks. Thus, it is a mistake in SNOBOL4 to write ‘AB's; one 


must write 'A + B'. This causes learning problems. 


The blank operator also requires placing a function call adja- 
cent to its arguments. A common mistake for beginners, for 
example, is to write: 


TRIM (INPUT) 


and wonder why the TRIM function didn't work. No error can be 
signalled for this sequence, of course, which dutifully 
prepends the input with the current value of the variable TRIM 
which is probably null. 


14. To compound the learning difficulties, the blank binary 
operator is also used to denote pattern matching. If one is 
teaching SNOBOL4 one must explain why the sixth blank below 
denotes pattern matching while the others denote concatenta- 
tion. 


((A BC) ABC) ABC 
15. While SNOBOL4 is more than just a string language, the 


facilities of the language are geared much more for. string 
processing than any other kind. For example, although SNOBOL4Y 


contains arrays there is no way to automatically sequence 
through an array as one can by pattern matching a string or as 
is possible with APL. Worse, SNOBOL4 does not even contain a 
conventional repetition- element like the DO-~-loop. Also, the 
tracing facilities, while quite useful for strings yield lit- 
tle information with arrays. When accessing strings to do 
fairly complex activities one does not mind paying a small in- 
terpretive overhead since this is a relatively small part of 
the overall computation. But the interpretive overhead of ar- 
ray processing can be several times the cost of accessing the 
array. The net result is that although SNOBOL4 contains ar- 
rays, it is not very good at processing them. One is much 
better off in some other language. Similar remarks may be made 
with perhaps less force about the programmer-defined datatype. 


16. There is some lanquage clutter which could be removed. 
In particular &TRIM, &INPUT and SOUTPUT were introduced into 
the language to overcome implementation inefficiencies of 
MAINBOL. The &ANCHOR keyword invites unstructured programming 
and should be abolished. The VALUE function was a nice idea 
but was defined incorrectly and, in its current form, is use- 
less, I know of no serious uses of the SUCCEED pattern but, 
if needed, one could use ARBNO(NULL) were it not for the fact 
that SNOBOL4 attempts to 'protect' you from having a null ar- 
gument to ARBNO. 
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A NOT function would have been better. See chapters 6-8 in 
this respect. 


It is hoped that the reader has not by now come to the conclu- 
sion that SNOBOL4 is’ an utter abomination. With care and 
foresight many of these deficiencies can not only be overcome 
but turned to advantage. We will see ample evidence of this 
in this and the remaining chapters. It is also the writer's 
hope that this catalog of defects can serve to dispel the no- 
tion that a recognition of a language's strengths is tan- 
tamount to being in love with the language and hence blind to 
its flaws. (This happens frequently but it is not a universal 
phenomenon.) 


Having thusly disposed of the bath water, and assuming that we 
still have our baby, we may proceed to the important topic of: 


ae eae a ee ee Se eee 
{ R&S tructured Programming {| An unsophisticated program- 
eS —hlrmer,) in )6 a Surge Of program 
{( ££EX | ming frenzy, will write a large program straight-out 
| % | over several pages which will exhibit no evidence of 
{ #88 | structure. Such programs generally prove to be bit- 
iu terly difficult to debug and modify. Dijkstra [1968] 
cited the over use of the goto as one of the most flagrant 
abuses in such run-on programs. Willy-nilly transfers of con- 
trol from one program segment to another results in a mangle 
of spaghetti-like confusion. In fact, the abuse has become so 
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great that a controversy has arisen over whether the goto 


It is this writer's contention that improper use of the goto 
is a symptom rather than a cause of poor structuring. To 
properly structure a large program it must be decomposed into 
smaller subroutines (or, equivalently, functions, procedures, 
etc.). Subroutinizing reduces the overall size of a program 
since the same section of code may be referred to by several 
different statements. It also allows greater flexibility in 
the writing of a program since it is often unclear at the 
start where an important subactivity will be needed. But the 
most important aspect of subroutinizing is the structure it 
endows the overall program. With reasonably well-defined in- 
terfaces between subroutines, the complexity of a large 
program becomes merely the sum of the complexity of the in- 
dividual component routines, not the product or some higher 
order function. Under such circumstances, the subroutine call 
becomes the primary method of inter-routine transfers of con- 
trol. Intra-routine transfers of control can quite comfortably 
be made with the goto. In fact, many algorithms described in 
a half dozen or so English statements use the goto as a means 
of making more precise that which might otherwise be am- 
biguous. Far from being inherently evil, the goto is a power- 
ful, and the most basic, control element. It is perhaps 
because of this power that it can so easily be abused. 


But whereas we may elect to keep the goto as a control element 
of last resort, it is not generally the best control structure 
for all circumstances. In particular, the IF ... THEN ... ELSE 
eee sequence as well as a repetition structure (such as the 
Fortran DO) are ideal in many instances. Their absence in 
SNOPOL4Y has led some critics to be unkind to the language. To 
a certain extent the deficiency is real, but is ameliorated 
considerably by what may be called the implicit iteration of 
pattern matching. Thus, the statement: 


Ss e ¢ = ee 


which removes the first blank from the string S contains an 
implicit iteration over the characters of the string S. The 
result is a statement which is considerably easier to under- 
stand than an explicit sequencing. Thus the reason for the 
lack of conventional control structures in SNOBOL4S is that the 
need for them is not felt so acutely. As confirmation of this 
supposition, APL, with its many forms of implicit array itera- 
tion, also lacks the standard control structures (other than 
the goto). 


It would not be correct tO conclude that to write large 
programs in SNOBOL4 we subroutinize everything in sight and 
let it go at that. Certain conventions must be followed with 
respect to names of labels, global variables, keywords, etc. 
so that separately written subroutines can co-exist comfor- 
tably. A system of conventions of this kind is followed in 
writing the individual functions in this book so that they in- 
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deed can be joined together without mutually interfering with 
each other. Many of the routines, in fact, call each other 
and the text processor which produced this book is a rather 
large assemblage (over 3000 statements) of functions which in 
some cases are identical to routines described and in all 
cases were written according to the conventions advocated. 


€88 onventions | In order to write well-structured 
cS ss programs) in SNOBOL4 it is rather more 
important to establish a system of conventions than 
in other languages. This is because the language 
does not support separately-compiled functions and 
iI hence there is a potential problem with name con- 
flicts. Another problem has to do with mode switches. For 
example, if we write a function which uses pattern matching, 
we are not generally free to set the mode of SANCHOR. To do 
so would set the mode of &ANCHOR for the calling routine. But 
how can the called function know which setting exists for the 
&ANCHOR switch? There are only two ways out of this dilemma; 
either the called routine saves the old value of SANCHOR, as- 
signs it a new value, and restores the old value before retur- 
ning, or it makes an assumption as to what its value will be 
and all routines live by that assumption. The first method is 
clearly too awkward and is made more odious by the thought 
that we would have to do the same for &FULLSCAN as well. 
Hence, our routines will assume these keywords to contain cer- 
tain values. There are perhaps good reasons to always assume 
S6ANCHOR to be on and/or to assume &FULLSCAN to be on, but we 
will abide by the convention that they always have their 
default value of 0 (off). 


It is possible to vary the value of variables having preas- 
signed (pattern) values such as ARB, BAL, FAIL, etc. However, 
it should ke obvious that it is poor practice to change these 
values for normal programming. The only exception may be to 
modify ARB (and other patterns) in an upperward compatible way 
for debugging purposes. For example, if we set: 


ARB = ARB $ OUTPUT 


at the beginning of the program then every string matched by 
ARB will be printed. Since such a modification only produces 
an upward compatible side-effect, and since the change is only 
temporary, no ill can come of it. 


It is also poor practice to redefine built-in operators and 
functions unless they are done in an upward compatible manner. 
For example, since the SIZE function is not pre-defined for 
array arguments it is not necessarily poor practice to 
redefine the SIZE function so that if the argument is an array 
it will return the number of elements in the array (a function 
which is very possible to write in SNOBOL4). On the other hand 
to redefine SIZE where it is already defined is to produce the 
sort of global change in the language |§ which makes 
subroutinizing difficult. 


How should names be kept separate to avoid collision? Con- 
flicts can occur with names of functions, variables, and 
labels. Since the number of functions are relatively small (a 
few hundred at most) there is generally no problem here. The 
names of functions in this book were generally chosen after 
English words and if this is the case conflicts are readily 
apparent. , 


Variable-name conflicts could be a severe problem if one does 
not subroutinize. If one does, the problem virtually disap- 
pears. One simply designates the variables to be temporary to 
some given procedure. If the functions are kept short enough 
no problems arise. It's occasionally necessary to use global 
variables. Here potential conflicts can arise unless one is 
careful. We will use the general policy of designating such 
global names with a name bearing one of the special characters 
'." or '_', This tends to reduce the possibility of collision. 
We will typically use the '.' in a pattern name to suggest 
that a variable is being assigned a value. Thus we may write: 


LEN1.T = LEN(1) .T 


and the name becomes a convenient mnemonic. In fact if this 
is not done a strong argument can be made that the use of a 
pre-defined pattern is too obscuring to be used as a general 
programming practice. 


To keep labels from conflicting we will employ the usual prac- 
tice of appending an identifying suffix to some convenient 
root. Thus, for function ALPHA, we can use labels ALPHA_1, 
ALPHA_2, etc. Labels such as LOOP or DONE are obviously poor 
practice except for examples or in a main routine but we al- 
ways shudder a bit when forced to contemplate them. 


We will rely a great deal on the following convention for 
defining functions. The DEFINE function must be executed in 
SNOBOL4Y before a function can be defined. For well-structured 
programs, the body of the function should be adjacent to the 
-function definition. The function body should not be entered 
other than via a function call. Hence we will use a hop-around . 
convention. To define the function ALPHA() we writes 


DEFINE ("ALPHA () ') 
Initialization for ALPHA 
: (ALPHA_END) 
ALPHA 


Function. body of ALPHA 
ALPHA_END 


As indicated here, unless we have special reasons for doing 
otherwise the entry label will be the same as the name of the 
function. Following the call to DEFINE(), we have what is 
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to variables, initialize tables, etc. The initialization sec- 
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tion is especially helpful in SNOBOL4 since for efficiency 
reasons many patterns should be defined ‘out-of-line'’. The 
ability to perform initializing computations on a per-function 
basis is not generally available in most programming 
languages. Hence, the hop-around technique, which at first 
appears to be a cumbersome apparatus for overcoming a language 
deficiency, becomes a language asset for structuring one's 
programs. 


Other conventions are as follows. Although the initial value 
of each variable is the null string, we will not generally use 
this fact. Hence, the initialization section is free to modify 
any variable not used globally (i.e., one whose name does not 
contain one of the special characters '.' or '_'). An excep- 
tion is the variable NULL whose value is never’ changed. Of 
course any variable which is a temporary variable of a func- 
tion will be automatically assigned the null string before 
function entry and this fact will be used throughout. 


Occasionally a transfer is made to the label ERROR. It is not 
necessarily presumed that a label named ERROR actually appears 
in the source program. If a branch is attempted to some un- 
defined label, the program will halt and an appropriate 
diagnostic will be given. This will indicate where the error 
occurred. It is also helpful in this regard and in general to 
always set &DUMP on (=1) at the start of the program as this 
can provide vital clues as to the source of any error. It is 
easy enough to turn the &DUMP off if the program terminates 
normally. 
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quently needed in a computer environment. We are 
presenting this material first, not necessarily 
because it is the easiest but because it is relatively 

ti unsophisticated. That is, the intent of a program that 
does a conversion will probably be clear even if nothing else 
is. SNOBOL4 is a good language to represent conversion al- 
gorithms because frequently the objects converted are strings. 
This is natural because we are normally converting between two 
external representations of the same thing and the way we 
represent things externally is most often via strings of 
characters. 


tr his chapter covers basic conversions of a kind fre- 
{ 
| 
{ 


ESS ee eee 

{{ Program {f{ UPLO is a program for converting all upper 
{1 2.1 tt case characters within a string to lower 
i] UPLO it case and vice versa. Thus UPLO('UPlO') will 
t__________-____—_4 return ‘upLO*. In all cases, characters 


which cannot be converted are left unchanged. The program as- 
sumes the IBM 360 EBCDIC encoding of characters [IBM360a; 
Appendix F}. There are many uses for such a program owing to 
the relative difficulty of keypunching lower case letters and 
the growing use of printers with lower case qraphics. 


Fe ne TC ee en ee ae en ee eT ee 

{| UPLO(S) will convert upper case to lower case and vice | 

{ versa. The argument S is an arbitrary string. Nonal- | 

{ phabetic characters are ignored. i] 

Wiscasset ihm pt igi eelgrass ii arma taihnd 
DEFINE ('UPIO(S) ') 


The first problem is to obtain the sequence of lower case | 
letters. This is done by a computation to avoid having to | 
type lower case letters in the program itself. The com- { 
putation depends on the fact that the upper case letters | 
and the lower case letters are arranged in an identical | 
pattern on the EBCDIC chart. The only difference is that | 
the lower case letters are in the 3rd quadrant (Q3) of | 
SALPHABET and the uppers are in the 4th quadrant (Q4). { 
| nn Te | 
SALPHABET LEN(128) LEN(64) . Q3 LEN(64) . Q4 


UPPERS_ = ‘ABCDEFGHIJKLMNOPQRSTUVWXYZ' 
LOWERS_ = REPLACE(UPPERS_,Q4,Q3) 
UP_LO = UPPERS_ LOWERS_ 

LO_UP = LOWERS_ UPPERS_ 


: (UPLO_END) 


Gree eo Pr Se tet Ee ee ep ie ee aE he eee ee ae ee TT ee PEE Aen aS TA Re he ee 
{ Then the function UPLO merely consists of a call to the [| 


| REPLACE function. { 
in i senna mmm Cg rec as tase ainsi 


UPLO UPLO = REPLACE(S, UP_LO, LO_UP) : (RETURN) 
UPLO_END 
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Epiloque 


As discussed in chapter one, we will generally begin a func- 
tion with a call to DEFINE. Following this is the initializa- 
tion section. Here we initialize variables such as UP_LO so 
that subsequent execution is fast. After initialization a 
transfer around the function body is made to a label which is 
normally the function name followed by '_END' (UPLO_END in our 
example). When the function is called, execution normally 
begins at the statement labeled with the same name as the name 
of the function (UPLO in this example). 


The encoding of UPLO depends on the arrangement of characters 
in the string 6ALPHABET. The characters shown in the box below 
are the result of printing SALPHABET on the printer used to 
produce this book. 


te ee aggre ge ee re ep ke eg ge ae 


1 

{ 

( 

{ w~o< (418 1$*) 5-| 
{ 1%_>? 2#at="| 
|-—_—— 
| 

I 

{ 

{ 

{ 


‘ 


“abcdefghit {<3,+4° jklmnopqr7}\3+¢ | 
~nStuvwxyze &-[ >90123456789)5, ]¥-| 


1 
ABCDEFGHI JKLMNOPOR 


1 
'  STUVWXYZ 0123456789 { 
| See 1 Rane aa ra Re ee ee ae Re ee ee | 


In EBCDIC, S6ALPHABET contains 256 characters which may be 
regarded as consisting of four quadrants of 64 characters 
each. In the above, each quadrant is printed in a separate 
sector as two lines of 32 characters each. It is easy to see 
from this table that the relative positions of the upper and 
lower case alphabets in their respective quadrants is the 
same. Hence it is possible to obtain the lower case alphabet 
from the upper case by a simple replacement. 


Although UPLO is character-code dependent, it can easily be 
modified for ASCII {ASCII}. In this case, S&ALPHABET contains 
128 characters whose printing graphics are shown (in order) 
below. 


ae eee ee 


P"HSZEY ()*4+,-. /0123456789: 3 <=>? 


{ ' 
| | 
( 1 
ed 
| QABCDEFGHIJKLMNOPQRSTUVWXYZ[\ J7_ | 
SSS 
{ { 


*abcdefghi jklmnopqrstuvwxyz{{}- 


UPLO can be modified to operate with such an &ALPHABET by 
changing five numbers. 


Go en, ee ee 

11 Program {{ The transition to the 3rd generation 
| 22 "W brought with it, for IBM users, a charac~ 
{{ BCD_EBCDIC |] ter conversion problem. The old 6-bit 
a ad BCD code was replaced by an expanded 
8-bit code. One disadvantage of the older code was that busi- 
ness and scientific users had different graphics for the same 
card code. In particular, the 5 characters #@%<& known only 
to the business users had the same card code respectively as 
='()+ which were known only to the scientific user. These 
two sets diverged in the 3rd generation. The fortunate busi- 
ness users saw no change, but the scientific user (such as the 
FORTRAN programmer) suddenly found lots of strange characters 
in his source program. 


In such cases one would like to write a program to convert an 
input deck with these 5 commercial characters into the scien- 
tific equivalents. One such program is Program 2.2; it appears 
on one line and in the days when we were converting to 3rd 
generation, I found it convenient to carry such a card on my 
person as a ready answer for anyone wishing to know the 
whereabouts of a program for translating BCD to EBCDIC. 


CS ee 
{ This is a complete program to convert BCD card code to | 
{| EBCDIC card code. Input cards will be read in, converted, | 
{ and punched. When no more cards remain the program | 
{ terminates. | 
i on ie cae ce i are a a 


L PUNCH = REPLACE(INPUT, "#a%<5", "="()+#") :S(L)  3END 


Epilogue 


This is a neat and compact example of the use of the REPLACE 
function. A card is read in and any character of the second 
argument found in this card is replaced by the corresponding 
character in the 3rd argument. The REPLACE function is fast, 
proceeding at machine speeds (on the IBM 360-70 a 256-byte 
table is set up, after which a single instruction (TR) trans- 
lates the entire string [IBM360a]). The REPLACE function is 
not only extremely useful for such transliterations but, as we 
shall see in the next chapter, can be used for permuting and 
rearranging characters as well. 


oe ne ee ee 

{1 Program || ROMAN will convert its argument, assumed to 
| 260 1{ be an integer, into Roman numeral format. 
{| ROMAN tt Thus, ROMAN(256) returns 'CCLVI'. Though a 
Ld classic problem in string manipulation, the 


reader may wonder about the utility of such a program (are we 
going to use SNOBOL4 to print tombstones?) . But there is one 
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common application in which such an algorithm is essential, 
viz. a text formatter which must number pages preceding the 
first with Roman numerals. In such cases it is customary to 
perform computations (such as adding one for each page) in the 
normal Arabic system before converting. In this example, the 
Roman numeral would normally appear in lower case. This con- 
version, if necessary, can be done using UPLO, Program 2.1. 


Although it occasionally happens that we wish to convert from 
Arabic to Roman we almost never want to do the reverse so that 
we will be content here with going in one direction only. 


Cp ee eg ens eg Te ne gee egy gy eee a SES ee eas ee yee ge 
{ ROMAN(N) will return a string equal to the Roman numeral | 
{ equivalent of the integer N. N is assumed to be less than | 
{ 4000 and nonnegative. { 
AER Rc a Ole ce ES A CIE Ae Ee Ie ee CO Se eee | 


DEFINE (' ROMAN (N) T*) : (ROMAN_END) 


eS ge ee eee ee 
{ Entry point: remove the last digit and call it T. { 
(os srvomene anne tnstoiey sian sess a ss els o-Ps a> Ps-tefronessseine serosal 


ROMAN N  RPOS(1) LEN(1) .T = :F (RETURN) 


We eee hn a ee eer ee Se ee ge, en 

{ Convert T to its equivalent Roman form. Then append it to | 

{| the Romanized form of the preceding digits multiplied by | 

{ 10. { 

insincere il ni mci tS tiated miners estate tiem cee 
"0,11, 211, 3111, 4I1V, 5V,6V1I,7VII, 8VIII,91X,° 


+ T  BREAK(',") .T : F (FRETURN) 
ROMAN = REPLACE(ROMAN(N), 'IVXLCDM', *XLCDM*#') T 

+ :S (RETURN) F (FRETURN) 

ROMAN_END 

Epiloque 


The big trick here is to realize that it is relatively easy to 
multiply a Roman number Ly 10 by merely doing a translitera- 
tion of its symbols into the next higher ‘octave’. This is 
done by REPLACE. Another trick which reduces the size of the 
program is to compact a set of information into a long string 
and use SNOBOL4's powerful pattern matching to extract the 
information. . 


This is not the fastest encoding of ROMAN. There was no effort 
to economize on time because it may be presumed that the use 
of ROMAN is infrequent. If anything, an effort was made to 
reduce the size of the program in order to minimize storage 
consumption. This is good practice for seldomly used code. 
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{1 Programs (I The decimal system in common use to 
tt 2.4 6 2.5 | represent numbers is a positional 
({ BASEB & BASE10 {| System, meaning that the value of a 
eed digit depends on its position. 


Generally, in a positional number system, the numeral 
A2gap eoe An 
represents the number 


n-1 n-2 
a,B + apoB + wee + Aan 


where B is some integer called the base. The decimal system 
uses B = 10. A positional system can represent arbitrarily 
large quantities with only a finite number (equal to B) of 
symbols. This is in contrast to the Roman numbers where the 
value of a symbol depends on the symbol itself and not on its 
position. Hence,. for arbitrarily large numbers, we need ar- 
bitrarily many symbols. 


Though our current decimal system was introduced in Europe by 
the Arabs in the 9th Century, the system did not flourish 
there until the 16th Century Spanish merchants were humiliated 
by the arithmetic prowess of the stone-age Mayan Indians who 
were using a base 20 positional system. See Von Hagen [ 1960}. 


The growth of computer systems in which base 2 arithmetic is 
used internally to represent numeric quantities has drawn at- 
tention to the representation of numbers in various bases and 
has led to the need in many cases to convert from one base to 
another. 


In this section we include two routines for base conversion. 
BASEB(N,B) will convert integer N into its representation in 
-base B. Thus, BASEB(15,3) will return ‘120! as this is the 
base 3 representation of 15. Conversly, BASE10(N,B) will con- 
vert the numeral N in base B to the equivalent decimal number. 
Thus BASE10('120',3) will return ‘15°. This is customarily 
written 


(120)3 = 15 


where the absence of an explicit base indication implies base 
10. 


To convert N from base bh, to base Eg we could combine the 
functions thusly: 


BASEB (BASE10 (N, by), bo) 


The characters used to indicate digits higher than 9 are the 
letters of the alphabet with A equal to 10, B equal to 11, 
etc. This seems to be the most common method of denoting the 
higher digits. On the other hand, there are dissenters who 
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Say that this encoding is unnatural in that the even letters 
(B, D, F, etc.) correspond to odd numbers (11, 13, 15, ...) 
whereas the odd letters (A, C, E, ...) correspond to even num- 
bers (10, 12, 14, ...)- These people might prefer the letters 
"'XARC.. rather than ‘ABC... another method might be to use 
some arbitrary sequence from the end of the alphabet such as 
"UVWXYZ' rather than 'ABCDEF*. In either case, the functions 
BASFB and BASE10 can be modified to suit by changing the value 
of the global variable BASEB_ALPHA. 


SS 
{| BASEB(N,B) will convert the integer N to its base B [ 
{ representation. B may be any positive integer <36. { 
ce cteeeneeneenenenneveieec enantio aeaneusanesneanssanvnnanmasaceemaimsesena 
DEFINE ('BASEB(N, B) R,C') 
BASEB_ALPHA = '0123456789ABCDEFGHIJKLMNOPORSTUVWXYZ ' 
: (BASEB_END) 


Gn a re eee Ee eg ee ARR Cr eg ee ee ee ee 
{ Entry point and top of loop: If N is zero we are done { 
acento aetna hrs ss ssn uP anursemeersnssoranananasnnasoussrssneanmncensamnall 
BASEB EQ (N, 0) 3S (RETURN) 


Cea tg ee ta ae ea es Ree ee PE Ee ER OT ae Oe ee NT ORE ge eee ee 
{ Obtain the base-B representation (C) of the least | 
{ significant digit of N. | 
eee 
R = REMDR(N,B) 
BASEB_ALPHA ‘TAB (*R) LEN(1) .C : F (ERROR) 


Fy 
{ Tack result onto previous value, update N and loop. | 
SN EEEEE | 
BASEB = C BASEB 
N = NZ/B : (BASEB) 
BASEB_END 


ee ee ee ee ne a ee 
{ BASE10 (N,B) will convert the string N assumed to be a | 
{ numeral expressed in base B arithmetic to decimal (base | 
{ 10). | 
NOR A eR RE Ee Panes EE PR ee ee EES 


DEFINE (* BASE10 (N,B) T*) 
BASEB_ALPHA = '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ' 
: (BASE10_ END) 


Ce ee ee Ne eh ee eee PEE TT Rt Sie ene hee TE ae CE ee ee 
{ Entry point and top of loop. Find first digit in N and | 
{ determine its value in base 10. { 
oneness aneaepte A suru eh ers > ce sur—tsnioLaeNnacemsemcewaas=vearanseccasell 


BASE10 N LEN(1) .T = :F (RETURN) 
BASEB_ALPHA BREAK(*T) @T : F (ERROR) 

GE ne ee eS EL RE ETE EF Ree Re ee ES TT a TT PET he aa eae ee 

{ Then use standard conversion algorithm for converting to | 

{ base 10. { 

etic we een eserves isbn er grunsorsr vs a tpenpn nea  rune}igamnemnnsesemamesinivsanfumtaiseaaaveasivall 
BASE10 = (BASE10 * B) + T : (BASE10) 


BASE10_END 
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Epilogue 


In BASEB, the search for the representation of the Rth charac- 
ter is done using the pattern 


TAB(*R) LEN(1) .C 
This pattern is identical in performance to the pattern 
TAB(R) LEN(1) .C 


Strangely enough, the former is faster in SPITBOL. This is 
because TAB(*R) LEN(1) . C is a constant valued pattern and 
can be pre-evaluated, whereas the same pattern without the '*!* 
is not constant. It requires more time, in general, to form 
the pattern than it does to do the pattern match so that much 
has been gained. A similar remark can be made about the pat- 
tern matching statement involving BREAK(*T) immediately fol- 
lowing label BASE10. 


In SNOBOL4, similar considerations apply except that the 
programmer must pre-evaluate his own expressions; the compiler 
will not do it for him. Thus 


CONVERT_R = TAB(*R) LEN(1) . C 


BASEB_ALPHA CONVERT_R 


would yield a more efficient rendition, in SNOBOL4, of the 
function BASEB. This is recommended if speed is of importance. 
The pattern CONVERT_R could be defined in the initialization 
section of the function thereby keeping the pattern associated 
with the function. But note that 


CONVERT_R = TAB(R) LEN(1) . C 


BASEB_ALPHA CONVERT_R 


would not be valid because the pattern CONVERT_R would be 
using the value of R at the time of assignment and not at the 
time of the pattern match. 


We will not always use a deferred form such as TAB(*R) but 
will generally prefer TAB(R). This is simpler and is not im- 
plementation dependent. It is always easy enough to modify 
the function so that a pattern is not continually being 
generated. Choosing the path of least resistance, as we will 
tend to do, has another advantage. For those programs for 
which space is more important than time, pre-defining the pat- 
tern is actually less efficient for the pattern must then 
occupy space continuously and not merely when it is needed. 
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CH me gg re ay ; 

1{ Program {f|{ To a human being a character is some 
i! 2.6 11 geometric configuration, but to a machine it 
(! HEX tt is just a sequence of bits. On the IBM 
L______________—_J 360-370 series machines, a character is a 


sequence of 8 bits. For example, the pattern of bits represen- 
ting the letter A is 


11000001 


it is obviously more convenient to write these 8 bits in base 
16 notation so that A comes out looking like 


ct 


HEX (S) is a function which will accept a string of characters 
and return a string of hexadecimal digits representing its in- 
ternal representation. Thus 


HEX (* ABA‘) 
returns 'C1C2C1!. 


All characters have an 68-bit code and all 68-bit codes 
represent some character, but not all characters are prin- 
table. Thus the SNOBOL4 keyword SALPHABET is a string of all 
the 8-bit characters starting with 00000000 and going on up to 
11111111 (in numerical order). If this string were to be 
printed (as we did earlier) most of the characters would ap- 
pear blank. The graphical image printed is a function of the 
printer. The IBM 1403 printer has room for at most 240 
graphics. Moreover, to increase printing speed there are many 
duplications of the more frequently appearing characters. The 
net result is that there are seldom more than 100 graphics in 
& ALPHABET. Thus, an important use of HEX is for processing 
data which is not character oriented and is therefore not 
easily dealt with in terms of characters. For example, suppose 
we wish to scan the input text for 2 consecutive occurrences 
of the hexadecimal constant 50. Then the following statement 
would perform the scan 


HEX (INPUT) POS (0) ARBNO(LEN(2)) '5050' 


71 
{ HEX(S) will return the hexadecimal (internal) representa- | 
{ tion of the string S. { 
| RICE a re a | 


DEFINE ('HEX(S) ') 


Se 
| Prepare tables of the Ist and 2nd hex digits. { 
Acces aes prepa minal lh ecm tc es ei plc cs cece matiaeasaal 


H = '0123456789ABCDEF' 
HEX_2ND =. DUPL(H,16) 
HEX_1 H LEN(1) .T = : F (HEX_END) 


HEX_1ST = HEX_1ST DUPL(T, 16) : (HEX_1) 


etl et lh Program 2.7 = CH So te Page: 31 


ca ec a I aac IR 
{ Entry point: Form the first and second digits separately | 
{ and then blend them. { 


nnn een eee neers te teehee sere neeetesnenateirtnreetnenreenenl 


HEX HEX = BLEND(REPLACE(S, SALPHABET, HEX_1ST), 

+ REPLACE (S, SALPHABET, HEX_2ND)) : (RETURN) 
HEX_END 

Names_referenced Name Type Where defined 
by_HEX: BLEND Function Program 3.7 

Epiloque 


We have taken an unusual approach in encoding HEX. It might 
seem at first that it would be better to prepare some table 
which would yield the correct pair of characters for every 
character in the SALPHABET. But we have already noted how fast 
REPLACE can be so that we can obtain either hex digit ex- 
tremely quickly. The question remains as to how we may swiftly 
merge the 2 character sequences. This we do by the program 
BLEND (Program 3.7) which merges 2 equi-length strings. As we 
shall see, BLEND also uses the REPLACE function in an unob- 
vious way and is quite rapid. 


ee ee 

{{ Program {f{ CH(H) will take a string of hexadecimal 
(1 2.7 | digits (H) and convert them to the cor- 
WI CH | responding character sequence. Thus 


i_——__-—___—__-4 CH('C1C2') will return AB‘, CH is the in- 
verse of HEX so that CH(HEX(S)) = S. The conversion provided 
by CH can be useful for obtaining characters that can be prin- 
ted but not typed. Thus CH('818283') returns ‘abc'. 


Dr nee ee Ee Ee Te ee Ne ge EE TT ee ee ee Be ee 
| CH(HEX) will convert the sequence of hexadecimal digits | 
{ into the corresponding character string. CH is the inverse | 
| of HEX. { 


Er Ee a a ee eT 
DEFINE ('CH (HEX) T,C,N') 


: (CH_END) 
re ee ee ON pee yg Cm TON Ga ET ge fe ae eae gee eae Soe a Ge te ag ee re 
{ Entry point: Remove 2 characters from string HEX. Then | 


{ convert to decimal (using BASE10) and retrieve the indexed | 
{ character from the SALPHABFT. | 


[rp ee | 


CH HEX LEN(2) .T = :F (RETURN) 

C = BASE10 (T, 16) 

SALPHABET LEN(C) LEN(1) .C 

CH = CH C 2 (CH) 
CH_END 
Names_referenced Name Type Where defined 


by CH: BASE10 Function Program 2.5 
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Epiloque 


The method used to program CH is to treat each pair of hex- 
adecimal characters as a number in base 16. This number can 
be converted to decimal using BASE10 (Program 2.5). This 
decimal number can then be used to index into the keyword 
& ALPHABET. 


Ceo re ee ee oN 
Program DAY will return the day of the week given 


it (1 

| 2.8 11 some date. Thus DAY ('3/24/71") will return 
1 DAY if 'WEDNESDAY', and DAY (DATE()) will return the 
1 —-—— = current day. As an added bonus, the global 
variable D will be set to an integer between 0 and 6 inclusive 
to give a numeric indication of the day. If a year other than 
one from the 20th century is intended then a 4-digit year must 
be given as in DAY('3/24/1825'). If the year is missing, the 
current year is assumed. Thus: 


"CHRISTMAS FALLS ON ' DAY('12/25') ' THIS YEAR.' 


will be a sematically correct string when evaluated, no matter 
in what year it is evaluated. 


The program assumes the Gregorian Calendar and will accept 
dates for any date from the 2nd century onward (i.e. after 100 
A.D.). The extrapolation into the time period before the 
Gregorian calendar went into effect (1588), however, will not 
agree with historical records. 


It is interesting to note that the revision of the calendar 
followed on the heels of the discoveries of Indian civiliza- 
tions in the New World whose elaborate and involved calendrics 
are said to be even more accurate than our present Gregorian 
calendar (see Morley [1956] for example). 


Pe poe en ee Oe NT eee Ee ge a eS a a Mae en ee Re oe Se ee eee ae a oe 
{ DAY(DATE) will return the day of the week appropriate to | 
{ the given DATE. DATE is given as month/day/year. ] 
aa eeseene seer enentrwsenemeceteenaseemsn nr a = stn psf psn nerusipenpearunssqnenesencenttniannpjusienaessmansneell 


DEFINE (' DAY (DATE) M,Y') 


Ce eg re at ge pe ye ee ee a a et re eee ET eae 
{ YEAR_ is the number of days in a year. YEAR_4, CENT_ and | 
{ CENT_4 are the number of days in the cyclic time periods | 
| of respectively 4 years, a century and 4 centuries. | 
Ce crenarenestrennevsnseenestenhttenensouansuan ene hh es trues seas neh ess slsatassbaehsotha- vane enumsnensencmsenteaeavennmell 


YEAR_ = 365 

YEAR_4 = 4 * YEAR + 1 

CENT. = (25 * YEAR_4) - 1 

CENT_4 = 4 * CENT. + 1 

DAY_ZERO = 2 

: (DAY_END) 

ottgt ee ep a Ye ee eee ee gas ee y NS a SEE a Gs At a gee ee Se ee Oe 
| First extract the month, day, and year. If the year is | 


{ null the current year (obtained from DATE) is used. Then | 


| '19" is prepended if the year is only 2 characters long. iF 
Tsoi cise malice hele omnia ene iim tee nepali iia aniicanaiaclaninemieitnioiol 


DAY DATE BREAK('/') . M  LEN(1) 

+ (BREAK('/') . D LEN(1) REM. Y |{ REM. D) 
(IDENT (Y,'*) DATE()) '/* ARB '/* REM. Y 
Y = EQ(SIZE(Y), 2) "49' 86 


Gr a a gm te te gee ee ee ag fe GR ef ep ee eet ae ae ee og ee ty 
{ The number of days since March 0, 0000 will be computed. | 
{ First compute the number of whole months and the number of | 
{ whole years since that date. { 
ria es ni ieee iv tit ti te eis mie iit pina pmsl 


M = LE(M,2) M +# 12 : F (DAY_ 1) 
Y= Y-1 
DAY_1 M = M- 3 


ee TE a a 
{ Now add an appropriate number of days for each cyclic year | 
{ period. Note: integer divided by integer yields integer. | 
nn en re Sb hess rfl ss sss SE UPunnavun-peusvonesa gus nssnmesenrennwerseusnumnsstel 
DAY_2 DAY = (¥ / 400) * CENT_4& + (REMDR(Y,400) / 100) * CENT_ 
° +  (REMDR(Y,100) 7 4) * YEAR_4 + REMDR(Y,4) * YEAR_ 


sn a 
{ Now add an appropriate amount for the month (note that 153 | 
| is the number of days in a 5-month period), the day, and | 
{ an initializing constant. This value is taken modulo 7 | 
{| and a search is made based on that value. ( 
Wiese catiesovc secession esis in cecni iiintiell 


DAY = DAY + ((153 * M) + 2) 7 5 + D + DAY_ZERO 
D = REMDR(DAY, 7) 
* OSUN IMON2TUES 3WEDNES4 THURS 5FRI6SATURT* 
+ D BREAK('01234567") . DAY 
DAY = DAY ‘DAY! 2 (RETURN) 
DAY_END . 
Epiloque 


This program was modified for SNOBOL4 from an Algol program by 
Tantzen [1963]. His version is slightly more efficient and we 
leave this refinement as an exercise. 


The program is done by a computation; it could also have been 
done by a look-up procedure in which a string might contain a 
month-day sequence in which the proper number of days are as- 
sociated with each month. In general, this would have been 
easier and less error-prone but would not have been as 
efficient. 


A very clever scheme is used to obtain the number of days that 
a given month is worth. It is recognized that if we start in 
March, the number of days per month is given by the sequence 
31 30 31 30 31 which repeats itself for effectively the 
remainder of the March - March year. The computation: 


153 * M+ 2 


5 
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is so calculated as to yield precisely the correct number of 
days. 


Ce 

{| Program | MDY(Y,D) will convert a year,day date into a 
(l 2.9 11 month/day/year date. For example MDY (71,83) 
{1 MDY (1 will return '3/24/71'. The global variables 
._—_-—_.--_____---4 M and D are set to equal the month and day 
respectively. MDY is useful in an environment where the system 
computes days but not months (such as OS 360). 


re ee Re eee a EN ee nT ee ERE Ree CT ee ae eee ee Ee 
{ MDY(Y,D) will convert its argument which is given as year j{ 


{ , day into month/day/year format. | 
je SRE a ee cc ae A I ee a Ee ET | 


DEFINE (*MDY (Y,DY) X,T") 


Gr ed er ete ge eT eT ee ae ee 
{ Set up 2 tables to be searched. One showing cumulative | 
{ days vs. month (DAY_MONTH) for normal years and one for | 
| leap years (LY_DAY_MONTH). { 
fe EE Eo ao ER ESE Ee EEE | 


DAY_MONTH = (334,12) (304, 11) (273,10) (243,9)! 
+ 1 (212,8) (181, 7) (151,6) (120, 5) (90,4) (59,3) (31,2) (0,1)! 
LY_DAY_MONTH = ' (335,12) (305,11) (274,10) (244,9) ! 
+ '(213,8) (182,7) (152, 6) (121,5) (91,4) (60,3) (31,2) (0,1)! 


RT CRE SSE AT EELS | 
{ Set up a pattern to search the tables. { 
ae rer encase tel l-aspartic sn perensnerenemsoeensnegemsaal 
I = SPAN('0123456789') 
SEARCH.X.M = '(* I $ X *GT(DY,X) ',* I $M 3 (MDY_END) 


COD ag ee ee ep tae apa eee Cree ap ee ape AE OE a eT ee ee ee ee ae act ee 
{ Entry point: Set up the proper table in T. Use leap year | 
{ table if Y is either (divisible by 400) or (divisible by 4 | 
{ but not 100). | 
ea ee 


- MDY T = EQ(REMDR(Y,400),0) LY_DAY_MONTH :S(MDY_1) 
T = EQ(REMDR(Y, 100) ,0) DAY_MONTH :S(MDY_1) 
T = EQ(REMDR(Y , 4),0) LY_DAY_MONTH :S(MDY_1) 
T = DAY_MONTH ‘ 


ee EE Ee Be LE eT eT Pee Te ee eg en ee 
{ Then search the table for the current month (M) and the | 
{ number of days (X) associated with that month. Fail if DY | 
{ is not a valid day. { 
a sense sensation  -—is e--suso-cumaseseenwenanannasesamtevsemecnmmcmssal 


MDY_1 T SEARCH. X.M °F (FRETURN) 
D = DY - X 
GT(D, 31) 2S (FRETURN) 
MDY = M '/' p tyvt y : (RETURN) 

MDY_ END 

Epilogue 


We have written this program in terms of a 'table-look-up! 
procedure (actually string look-up would be more correct). But 
we could have done this by computational methods by turning 


ee ee etn ree es Greene nme Se ee 


the DAY function around and ‘pointing it backward'. This we 
invite the reader to try as an Exercise. 


eee 
Prograr SPELL(N) will return an English phrase 


| (1 

! 2-10 1 designating the integer N. Thus SPELL (13) 
] | will return ‘THIRTEEN’. SPELL will convert 
Ne ed all integers from 0 to 999999999 (a thousand 
million - 1). SPELL can easily be extended to handle larger 
ranges; see Exercise 2.16. One obvious application of SPELL 
is in writing checks. 


{ 
t 
| 


DEFINE ("SPELL (N) M") : (SPELL_END) 


am a aN A RR SIE IES, | 
{ Entry Point: Fan out to one of several labels depending | 


{ on the value of N. . ] 
pCR SR ER Se cee ee a OS OE a SE Ee en Ee nn EE ee | 


SPELL GE (N, 1000) 2S (SPELL_1000) 
GE(N, 100) 2S (SPELL_ 100) 
GE(N, 20) :S(SPELL_20) 
GE(N, 13) :S(SPELL_ 13) 


CPs rc ea eae ee Le ee ee eg ee he a ee ee 
{ Here if N is 12 or less; look its value up in a table. ] 
caine eh meen ees emi cig silat oe ce bani Sok tc mai eee eae since beieg aed 

(* 10NE, 2TWO, 3THREE, 4FOUR, 5FIVE,6SIX, 7SEVEN, 8EIGHT, 9NINE, ! 
+ T1IOTEN, 11ELEVEN, 12TWELVE, ') N ARB . SPELL ‘'‘,! : (RETURN) 


a ma ARERR I A a I ED | 
| Here to do the teens. It will be simpler to do the tens | 
| version and substitute 'TEEN' for 'Ty' afterward. { 
ieee ies Stab mmr tec semen ii ngs ssn i es memes amin epics oainscaneaintemmanal 


SPELL_13 N 1 LEN(1) .M 


SPELL = SPELL(M 0) 
SPELL ‘'Ty' = ‘TEEN! 
SPELL 'FOR' = ‘FOUR! : (RETURN) 


Osc be eat Boe he pe ee ee Ee eee a ee ee oe ee Oe eee eee 
{ Here to handle all compounds from 20 through 99. Just look | 
{| up the root in a table and add the suffix 'TY'. Then call | 
{ SPELL recursively to handle the units. { 
spss a ace i abiiteeesatsepclstcespmec scciicesscaachche—tsitaeacaaieatciaiad 
SPELL_20 N LEN(1) .M = 
'2TWEN, 3THIR, 4FOR, 5SFIF,6SIX, SEVEN, 8EIGH, ININE, * 

+ M BREAK (',') . SPELL 

SPELL = SPELL '‘TyY'* 

SPELL = NE(N,0) SPELL '-' SPELL(N)  : (RETURN) 


Co ee ae ee  N, 
{ Hundreds are handled by converting the hundreds and tens | 
{ recursively. { 
rie pcan meine nem the ele iii i it Ghani i inh nar camiccaiamtomeoiemiae 
SPELL_100 N LEN(1) .M = 

SPELL = SPELL (M) * HUNDRED! 

SPELL = NE(N,0) SPELL ' AND * SPELL(N) : (RETURN) 


a a ee ee ee ee ed 
| For numbers over 1000, remove all but the last three | 
{ digits of N assigning them to M. Convert M, 'multiply' it |. 


2° || een Sererieron. 9) b 255-2 8) [- Seen eee ee 
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| by 1000 and ‘tadd' N. { 
[SS a Ne a a a Oe | 


SPELL_ 1000 

N RTAB(3) .M = 

SPELL = SPELL(M) 

SPELL *THOUSAND'! = ‘MILLION! 

SPELL = SPELL ‘' THOUSAND‘ 

SPELL = NE(N,0) SPELL ' AND *' SPELL(N) : (RETURN) 
SPELL_END 
Epilogue 


SPELL was written to be small rather than fast and uses recur- 
sion quite liberally and effectively to render a smaller and 
more readable program. 


We ae aE OE EOE A AE A EE a a a 
222? EXERCISES ?2???2?2?2227272222?? 
SE a aE a a a A aD 222 


~) 
~) 
~) 
~) 
~ 
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{ Exercise 2.1 {| Using strings prepared in the initialization 
t___-___—-——-I._ section of UPLO write a function UP() which 
will convert any lower case in its argument to upper case. 


Coo ee te ee 

{ Exercise 2.2 {| Given the function UPLIO() and a function 
t.____-______-__-J_ UP() which converts lower case to upper 
case, write a function LO() which converts upper case to lower 
case. 


TS a ee 

{ Exercise 2.3 | Given a paragraph in P assumed keypunched in 
L_________-___._5 upper case, use UPLO to convert P into lower 
case except that the first character of every sentence should 
remain capitalized. The first nonblank character is regarded 
as the beginning of the first sentence. Subsequent sentences 
are marked by a period followed by at least 2 blanks. (This 
requires only two statements.) 


fae ee oe 

{ Exercise 2.4 {| Write a function (ARABIC) to convert a num- 
i________.____5 ber in the Roman representation to one in 
Standard (kase 10) notation. 


St a ee 

{ Exercise 2.5 | Let {x} be the smallest integer 2 the real 
-________--_J_ number x (sometimes referred to as the 
ceiling of x). Thus 


{1-5} 
{2. 0} 
{-9.5} 


Wh dO 
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With the help of functions defined in this section write 
SNOBOL4 expressions equivalent to 


{loge K} 
{logn K} 
where K and n are positive integers. 
| a i Bre | ; 
{| Exercise 2.6 | The Mayan Indians used a base 20 positional 


L__———-_————-——-J_ number system. The figures for the digits 0 
thru 19 were built up systematically as in the table below. 


Arabic Mayan Arabic Mayan 
form equiv form equiv 
0 0 10 tt 
1 . 11 {f. 
2 oe 12 [t-. 
3 eos 13 ([eoe 
4 <ee a 14 [loose 
5 { 15 itt 
6 1- 16 ttl. 
7 |<. 17 1tt.. 
8 ere 18 oo 
9 hose 19 1tf won 


a ee a a a i 2 a ee 2 a a a rr ae ee we eee 
SSS SSS SS SS SS SS SS SS SS SS SS SS SS SS SES SS SE SS SS SS SE SS SS SS SS SS 


Hence the number 752 would be represented as 


Here the digits are run from left to right in descending 
significance whereas the Mayans would allign their digits ver- 
tically. Also the dots ran in a direction orthogonal to the 
bars. One has a great deal more freedom in these matters if 
one is merely carving the figures out of stone. 


The exercise is, given the integer N write a loop to convert N 


to its Mayan form. This can be done in 4 statements (without 
using the functions defined in this chapter). 


ee eee 

| Exercise 2.7 {| A hypothetical machine has a word size of 32 
.____________-I bits represented as bybs ... bgo. The bits 
have the following meaning when representing floating point. 
S: by, (sign) O:positive, 1:negative 


Es (boces bind exponent of 2 in excess 1024 notation 
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Fs: {Deo ee. Dga} fractional part with decimal point to the 
left of byo. 


Hence a floating point number will have the value: 


s F (E-512) 
2 


21 
2 


Write a function (using the base conversion algorithms) to 
convert an eight-hexadecimal-digqit machine word into a 
floating point number. 


ee ee 

{ Exercise 2.8 | Extend the routines BASEB and BASE10 to han- 
—__________-_-—J_ dle decimal points. Assume a global cell 
PRECISION which will hold the number of digits of precision 
required in the fraction. Allow BASEB and BASE10 to call 
themselves recursively. 


aes ceereeaaes | 

{ Exercise 2.9 | What statements would have to be modified if 
t_______-_..__.—-_) BASEB and BASE10 were to be extended to 
unlimited-precision arithmetic? 


[eo eee 
{ Exercise 2.10 | Let Y, N and M be integers. 
[ ES ea aera en | 


a) Show that: 
REMDR(Y, N¥M) /N = (Y/N) -— (Y/ (M*N) ) *M 


and hence that line labeled DAY_2 in Program 2.8 can be 
rewritten: 


DAY_2 DAY = (¥ / 400) * K1 + (¥ / 100) * K2 
: + (Y/ 4) * K3 + Y * KG 


where K1, K2, K3, K4 are values which can be precomputed. 


b) Compute K1, K2, K3, K4. 


SS ee ee ee ee 

{ Exercise 2.11 | Suppose there are 64 characters in 
_——_______—___-_——/  §AILPHABET. Rewrite HEX so that it returns 
the base-8 representation of a string. Call the function 
OCTAL. 


Sanaa aR, | 

| Exercise 2.12 { In writing a compiler it is sometimes 
L.________—_______4 necessary to manipulate bits since the 
instruction is formed as a sequence of bits. 
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a) Set the Nth bit of a string S to 1. Assume the bits are 
numbered starting with 0 and ending with 8 * SIZE(S) - 1 (This 
assumes 8 bits per character). 


b) Invert the Nth bit of a string S. 


Qa er ee 

{ Exercise 2.13 { Using DAY, determine whether a given date 
LS valid. For example, 2/29/1973 is 
invalid. 

CO a ae ee ee 

{ Exercise 2.14 | Using DAY, write a program which prints a 


I calendar for the month M and year Y. 


ee ee 

{ Exercise 2.15 { Given that the number of days since March 0 
J js (153*M+2)/5 where M is the number of 
whole months since that date, write an expression for the num- 
ber of whole months given the number of days. Using this for- 
mula rewrite MDY as a computation. 


aaa S| : 

{ Exercise 2.16 {| Assuming that a billion is a thousand mil- 
L______._.________J lion, add a single statement to SPELL to 
increase the range of convertable numbers to a thousand bil- 
lion - 1. 


Cr es a en 

{ Exercise 2.17 {| In the U.S. the terms billion, trillion, 
tJ quadrillion, quintillion, sextillion, sep- 
tillion and octillion refer to the numbers 1000 million, 10002 
million, 10003 million,..., 10007 million respectively whereas 
in Great Britain these terms refer respectively to million?2, 
million3, million*,..., million®. Extend SPELL so that it will 
convert its argument up to the octillions in the British 
system. Note that SNOBCL4 integers don't go that high so as- 
sume the input is string and don't use arithmetic operators 
(like GE) on anything too big. 


ooo Fh kis tee 

| Exercise 2.18 | Pick a number; count the letters in its 
__________.___-J.  spelled-out form and you produce a new num- 
ber. For example 13 is spelled 'THIRTEEN' and hence transforms 
into 8. This transformation has the interesting property that 
its repeated application will cause every number to converge 
rapidly to 4. For example, starting with 13, the sequence 


13 8 5 4 4 &§ & ,.. 
is produced. Write a program to determine the smallest integer 


between 0 and 10000 which requires the most steps to converge 
to 4 (the integer is 113 and it requires 6 steps). 
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oer ee ey ome 
{ Exercise 2.19 | The musical scale is given by the following 
LI sequence of 12 notes. 


C C# D D# E F# G G# A A¥ B 
Given a number N between 1 and 12, write a single pattern- 


matching statement to assign the Nth note (a one or two 
character string) to the variable NOTE. 
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—a 
{-—  NOBOL4 represents strings by a pointer to string 
{t—, storage. One of the consequences of this storage 
— management philosophy is that the cost of string as- 


t 
re—sJ} signment is relatively low. That is, it costs very 
i——1 little to interchange string values among variables, 
In particular it is relatively inexpensive to pass string 
values to and from functions. 


The functions presented in this chapter all are fairly short 
utility-like functions which operate primarily with strings. 
We will see most of these functions later in the book where 
they will serve as lemma-like procedures to make larger 
programs more understandable. 


a {1 ORDER(S) will return an alphabetized version 
11 3.1 t{ of its argument S. Thus, ORDER (' ORDER") 
11 | will return 'DEORR'. The alphabetic ordering 
t_._-__-____--____-_ of characters is determined, as usual, by 
SALPHABET. To modify the ordering produced by ORDER the state- 
ment containing this keyword should be replaced. ORDER, as we 
will see, has many uses. For example, it furnishes an easy 
way to check for set equality. 


ra oe oe ee ne ne ge eT Ee pe a ne ee ee Ee a ee ee ee 
{ ORDER(S) will put the characters of its argument in al- | 
! phabetic order. | 
ater ese ore nt ens enrorsorn avert ep -sss ebssS-rl- a es PTE <ssi sssiih PS rs gsr mnaastaenenannawivnesrassarel 
DEFINE ('ORDER(S) T, HIGHS, S1"') 
: (ORDER_END) 


rte en ee ee ee ee Ee EE eT ee Te OEE OE PPE Ce ee de ene eee 
{| Entry Point: Extract a character (T) from S; obtain (in | 
{ HIGHS) characters alphabetically 2 the extracted charac | 
| ter. Then scan ORDER for the first occurrence of one of | 
{ these higher characters. { 
(scientist aot agli nial ni edocs catia 


ORDER Ss LEN(1) .T = :F (RETURN) 
S&ALPHABET BREAK(T) REM . HIGHS 
CRDER (BREAK (HIGHS) {| REM) . S1 = S1T : (ORDER) 
ORDER_END 
Epilogue 


ORDER is essentially a sorting routine and as such it is an 
insertion sort. Characters are extracted one at a time from 
the argument S and are inserted in order into the growing 
string ORDER. 
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ee cee ee ern ere cee ete eY mee aoe een eo a 


Ott ote a oe ee : 
Programs (available in SPITBOL and SITBOL) LPAD 


(1 tt 

if 3.2 & 3.3 tf! and RPAD are useful in formatting line 
1! LPAD & RPAD | output. They are patterned after the 
L-——____—--—---- J built-in functions in SPITBOL and are 
included here for use with SNOBOL4. LPAD will pad on the left 
to fill out a string to the required field width and RPAD will 
pad on the right. Thus 


OUTPUT = RPAD(S1,60) LPAD(S2,60) 


will place string S1 on the left and string S2 on the extreme 
right of a computer printout page that happens to be 120 
characters wide. Both functions may be called with a 3rd ar- 
gument to indicate a pad character other than a blank. 


a a a a TR | 
LPAD (S,N,C) will pad string S on the left with character | 

C until the string is N characters long. S is returned if | 

it is 2 N characters long. Cis taken to be * * if | 

unspecified. | 

seein senses msi ibs ilies destiny aie et cisions 

DEFINE ('LPAD(S,N,C) ') : (LPAD_END) 

LPAD LPAD = GE(SIZE(S),N) S :S (RETURN) 

Cc = IDENT(C) ' § 

LPAD = DUPL(C, N - SIZE(S)) S : (RETURN) 


LPAD_END 


a a gd a Ea eee TE Te gee es gee ee Cee ee RS ke SP eg ne a a ee ON 
{ RPAD(S,N,C) pads on the right rather than on the left but | 
{ its behaviour is otherwise the same as LPAD. { 
a ee ee 


DEFINE ('RPAD (S,N,C) *) : (RPAD_END) 
RPAD RPAD = GE(SIZE(S), N) §S :S (RETURN) 

Cc = IDENT(C) ' ! 

RPAD = S  DUPL(C, N - SIZE(S)) : (RETURN) 
RPAD_END 
mat erence | 


{{ Program {j{f{ COUNT (S1,S2) will count the number of occur- 
tf 3.4 tt rences of string S2 in S11. Overlapping 
" tI Occurrences of S1 are counted as separate 
tJ occurrences. Thus COUNT('MISSISSIPPI', 'SI*) 
returns 2, and COUNT ('AAA't, ‘AA‘t) also returns 2. If a sub- 
string is not found the function effectively returns a zero 
(actually the null string). 


Ce eae ee ee ne a ee ee ne eee Ee a OE ee ee ee ee 
{ COUNT (S1,S2) counts the number of occurrences of string | 
{ S2 in string $1. | 
[ee 
DEFINE (‘COUNT (S1,S2) FIRST, REST, P*) 
: (COUNT_END) 


ee ee ee ee ee ee ee ag a ee ae 
| Entry point: Set up pattern P to scan S1. P makes rapid | 
{ scan for first character of S2 and then checks to see if | 


{ S2 matches. { 

sana cscs edo ibloldors  teieiiietepaeeiaaapimiaitcecsos 

COUNT $2 LEN(1) . FIRST REM . REST :F (RETURN) 
P = POS(0) BREAKX(FIRST) S2 


{ Find and remove all characters up to an occurrence of S2. | 
{ If found put all but first character of S2 back onto S1. 1 
a cae nul mio pi eg a ie ie titi ata eaten 


COUNT_1 S1 P = REST :F (RETURN) 
COUNT = COUNT + 1 : (COUNT_ 1) 

COUNT_END 

Names_ referenced Name Type Where defined 

by COUNT: BREAKX Function Program 8.2 

Epiloque 


The simple-minded approach to this problem is to simply scan 
the string S1 for an occurrence of the string S2, removing all 
that precedes the substring and repeating the process until no 
more occurrences are found. A faster technique (used here) is 
to use the high speed operation of the BREAK function which 
scans across a string at machine speeds looking for one of a 
class of characters. If successful, then and only then is the 
entire word (S2) matched. To employ BREAK in this way it is 
convenient to use BREAKX which is defined in Program 8.2 
(BREAKX is a built-in function in SPITBOL but not available in 
SNOBOL4). BREAKX, unlike BREAK, has implicit alternatives. 
If a pattern to its right (its subsequent) fails, it will try 
again, picking up one character to the right of where it left 
off. 


Ca a 

{{ Program {| ROTATER(S,N) will rotate the string S right 
11 3.5 tt by N characters. If N is negative the rota- 
{{ ROTATER |[f{ tion will be to the left. Thus 


t__—___________J ROTATER('ABCD',1) will return 'DABC'. 


ee ep ee Tie en ee ee RE Te Citgo eT ae RT RE Se ee he kee eee Oe 
{ ROTATER(S,N) will rotate the string S right by N' charac- | 
| ters. If N is negative, S will be rotated to the left. { 
fos aeee-etenseestonensenpnssatuy peste sts = sess ssh ss eben ao ssrsaaeanpnmnemnenstcemecnrenamnsarcianasnal 
DEFINE (*ROTATER (S,N) S1") : (ROTATER_END) 
ce ere a eee ee a OR eh Pee a ee Fe ene ee Qe ge ne ee eg ge Te eee 
{ Entry point: If S is null, return. { 
rece test hp A rs ss a te sss sees steel Us i OS Us bsei es-ssasnansasensisSNatneDonenneReeD 


ROTATER IDENT (S) :S (RETURN) 


We ee ey mg ee ee eg TE ge OR TE NT ee Oa ee Cope gL ae SE ee Oe 
{ Reduce mumber of positions to be rotated modulo SIZE(S). | 
{| Note REMDR preserves the sign of Nw. If N is negative, use | 
{ complement. I 
Ma ar 
N REMDR(N, SIZE(S)) 
N LT (N, 0) SIZE(S) - N 
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SS any 
| Perform the rotation and return { 
cn IO REE eR EY EE ENC Se BORON 28 ETT SOE Ee EST ee ae Oe 
Ss RTAB(N) - S REM. S1 = S1 §$ : 
ROTATER = S 2: (RETURN) 
ROTATER_END 


{| {1 (available in SPITBOL and SITBOL) REVERSE (S) 
| 3.6 tt will return S with its characters reversed. 
tt t! Thus REVERSE ('*SERUTAN') will return 
t________——--__--4 'NATURES ', One use of REVERSE is to effec- 
tively reverse the order of pattern matching. For example, if 
one wishes to replace the last occurrence of the substring SS 
in the string S with the string R one can write: 


S = REVERSE(S) 
S REVERSE(SS) = REVERSE(R) 
S = REVERSE(S) 


{ REVERSE(S) will reverse the sequence of characters in the | 
{ string S and return the result. { 
REI eh cence eae Ne ae Ee ea NN a en CE ee EE De ECCS SC ee ST EE 


DEFINE ("REVERSE (S)A1,A2,L") 


ee en ee ee ee ee a eT ee eee 

| Initialize REV_ALPHA to hold the reversed alphabet. { 

| En re rr EEE | 
TEMP = €&ALPHABET 

REV_1 TEMP LEN(1) .T = : F (REVERSE_END) 
REV_ALPHA = T REV_ALPHA : (REV_1) 


i a a ne EEE | 
| Entry point: For oversize strings go to REVERSE_1. Also | 
{ ignore nuil strings. { 
a caneosemrenanessreneputreensciencanas Sse sss SU Ds SSDs tl el i s- -uniesspssnesnsanoaeenanneenrenenennensmnananrasnel 


REVERSE L = SIZE(S) 
GT (L, 256) :S (REVERSE_1) 
LE (L, 0) :S (RETURN) 


re rn ee ae nae ee ee ee te ee ee ee ee 
{| Take the first L characters of SALPHABET and the last L | 
| characters of the reversed alphabet and issue a REPLACE. 1 
Shc sesnccseeaatehpenpilii iim cp ei emcee emmveesnenrarceiaaibeseaseteiasinae dbeeisoeietiia ldsasliereemtriialhsibtisiteeenaaiiaiieciaiiio 

& ALPHABET TAB(*L) . Al 

REV_ALPHA RTAB (*L) REM . A2 

REVERSE = REPLACE (A2,A1,S) : (RETURN) 
ag ae Oe ne a ee ee a Rn a ee tee ee ee 
{ Divide and Conquer. 1 
cesarean tes ip emesis lai eerie items i inleaGcteinaesalaaaa 
REVERSE_1 s LEN (256) . Al REM . A2 

REVERSE = REVERSE (A2) REVERSE (A1) : (RETURN) 

REVERSE_END 
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Epiloque 


The method used to perform the reversal follows a suggestion 
by Morris Siegel. It transforms a string, not by setting up 
the last 2 arguments of REPLACE and effecting a translitera- 
tion, but by setting up the first 2 arguments to accomplish a 
rearrangement. We will elaborate on this before continuing to 
the next function. 


Roe ee ee ee Pe ee ee ee 

{ *€88 tring Transformations {| A string transformation is any 
{ % co ——ss Function which accepts a 
{| £8%% { string as argument and returns a string as_ value. 
{ %{ As ae humble example, TRIM(S) is a transformation 
( ##%% { which produces a string without trailing blanks. 
ti—__--J Special kinds of transformations exist which are 
either interesting in their own right or can be programmed to 
run very rapidly. 


A homomorphism is a transformation T such that 
T(S, Se) = T(Sy) T(Sa) (3. 1) 


That is, the transformation of the concatenation is equal to 
the concatenation of the transformations. Said another way, 
the transformation is context: free. Since any string S can 
ultimately be decomposed into characters, cycp ... Cyn we have 


T(S) = T(cqy) Tl(ca) --- T (Cn) (3.2) 


And from this last equation we can see that a homomorphism is 
characters. Let aap eee Aan be a list of all the characters 
of the alphabet. Then the set of strings [{T(ay), Tao), eve, 
T(an)} identify completely and unambiguously the transforma- 
tion T. 


A transliteration is an important special case of a homomor- 
phism in that each of the strings [T(ay), T(ae), «+e, T(an)} 
is a character. If T is a transliteration then T can be 


programmed in SNOBOL4 as: 

T(S) = REPLACE(S, &ALPHABET, T(&ALPHABET) ) (3. 3) 
In this way any transliteration can be programmed to run very 
swiftly merely by obtaining the transliteration of &ALPHABET. 
We have seen a number of examples of transliterations. 


Programs UPLO (2.1), BCD_EBCDIC(2.2) and HEX(2.6) all make use 
of REPLACE to perform the transliteration. 


Consider the following statement 
S = REPLACE(S, S,, So) (3.4) 


Here S, and Ss, are two equi-length strings which describe a 
transliteration on the string Ss. In fact, only those charac- 
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ters which appear in S, undergo a change. If we subject 
S&ALPHABET to such a transliteration to obtain 


TT = REPLACE(&ALPHABET, Sy, So) (3.5) 


we can use the result to effect the same transliteration on S 
as in (3.4). 


S = REPLACE(S, SALPHABET, TT) (3.6) 


A k-transformation is a string transformation that operates 
only on strings of length k and is undefined for strings of 
other length. (Its domain is said to consist of the strings 
of length k.) For example, the permutation (1 3 2) which 
rearranges the 2nd and 3rd characters of a string of length 3 
is a 3-transformation since it only aprflies to strings of 
length 3. 


A positional transformation is a k-transformation in which the 
output is some rearrangement of the characters of the input 
string with the properties that 1) characters in some posi- 
tions of the input string may be dropped, while others may 
appear several times and 2) constant characters may be added 
into some fixed positions of the output string. But in any 
case the disposition of a character depends on its position 
and not its value. More formally, the positional transforma- 
tion on strings of length k can be described as: 


t c t c ece t Cc t 
1 iy 2 lo n in n+1 


where ty, to, «--- are constant strings depending only on _ the 
transformation and iy, ios -«-, in are constant integers 
chosen from the set ni, 2e eae 7K e 


An example of a positional transformation is depicted 
graphically in Figure 3.1. It transforms a restricted class 
of English words into the corresponding ‘pig Latin'. Thus DIG 
becomes IGDAY, DOG becomes OGDAY and CAT becomes ATCAY. In 
general, it permutes a 3-character string and appends an ‘AY'. 


Another example of a positional transformation, one chosen 
from a more practical point of view, is the translation from 
ASCII to EBCDIC (see [IBM360a], App. F and [ASCII]). This 
transformation is indicated graphically in Figure 3.2. It, 
for example, transforms the ASCII code 1010101 to 10110101. 


A call to the replace function REPLACE(S,,S2,S3) is said to 
be well-defined if Sz is as long as S3. If repeated characters 


exist in Sz, the last appearance of each character will in- 
Gicate the mapping. In this latter case the operation of the 
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Figure 3.1 


A positional transformation that translates three- 
character words into their pig-latin equivalent. 


function would not be ambiguous although the programmer's 
motives might be. 


As we have described earlier, every transformation T defined 
as 


T(S) = REPLACE(S,S,,So) 


is a transliteration provided the operation is well-defined. 
Also, as has been previously noted, any transliteration T can 
be written as REPLACE(S,S,,S2) for some S,, Ss- Hence the set 
of all transliterations are identical with the set of all 
REPLACE's with given 2nd and 3rd arguments. 


In a considerably less okvious way, the positional transforma- 
tions can also be implemented by the REPLACE function. 


‘For any strings S,, Ss, the transformation defined as 
T(S) = REPLACE(Sa, So, S) 


is a positional k-transformation on S where k is the size of 
Soe 


Conversely, any positional transformation satisfying certain 
size constraints can be written as a REPLACE. Let P(S) be a 
positional k-transformation. Let S, be a string composed of k 
different characters none of which are included in the 
constant characters of the mapping. Then we can express P as 
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| cee | | eens | 

{ i 7 I 1 

— ey ee 

{ |——— |———> | { 

[ a | { [ ee | 

{ Cc 
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Figure 3.2 
A positional transformation for converting ASCII 
to EBCDIC. 

LR a eR A ik a | 
| P(S) = REPLACE(P(S4)s¢ Saye S) | 
| rn | 


Like the transliterations, we need only obtain the positional 
transformation for one model string to set up a high speed 


program for transforming all strings in the domain. 


As an example, the transformation indicated in Figure 3.1 can 


be expressed as 
REPLACE ('OGDAY', 'DOG',S) 


As another example the transformation indicated in Figure 
can be expressed as 


REPLACE (*12134567', '1234567', S) 


The characters in the model string must all be different 
any constant characters added to the string. Moreover, 
characters in the model string must all be different from 
other except that characters corresponding to positions 
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are dropped may be duplicates of other characters which follow 
them. Thus 


REPLACE ('Xy¥', 'XYYYY',S) 


will extract the first and last characters from S provided Ss 
is 5 characters long. Therefore, the size constraints imposed 
by the REPLACE function are that the total number of charac- 
ters in the second argument (i.e. k) plus the number of dif- 
ferent constant characters added in the mapping minus the 
positions ignored plus 1 if the last position is ignored 
should not exceed the size of S&ALPHABET. 


A permutation of a string is simply a rearrangement of its 
characters and clearly this is a special case of a positional 
transformation. String reversal, of a constant length string, 
is a permutation and hence can be accomplished by using 
REPLACE with suitable 1st and 2nd arguments. But string- 
reversal of arbitrary length strings represents a class of 
permutations and for this reason REVERSE must prepare ap- 
propriate 1st and 2nd arguments depending on the particular k- 
transformation it must deal with. But this preparation is 
rapidly accomplished by a simple fixed-length pattern matching 
operation. 


ea ee 

{{ Program {|| BLEND (X,Y) will merge the two strings X and 
| 3.7 {1 Y taking the first character from X, the 2nd 
tI BLEND WW from Y, the 3rd from xX, etc. Thus 
tI BLEND (*ABC','123") equals ‘A1B2C3*. BLEND 


has been used previously by the HEX function (Program 2.6) and 
is an example of a class of positional transformations which 
can be programmed to run quite rapidly. The 2 strings X and Y 
are either the same length or X is one character longer than 
pe Thus BLEND ( 'CHAPTER', DUPL(' ',6)) will return 
'C HAPTER‘. BLEND's of strings not satisfying these 
constraints are undefined. 


ee ey PE Ee ee ae ep Le Py ar ptr ae a ER Ee miey tee Penge el 

| BLEND(S1,S2) will blend the two (equi-length) strings S1 | 

{ and S2 such that every other character is taken from each | 

{ string. Thus BLEND('ABC','123") will return 'A1B2C3*. { 

Ose ere ee ere este sr rh teens sererntesernessnpsnvemosoaamsvenanaanniwanatemwaf 
DEFINE (*BLEND(S1,S2) T1,T2,ABC,XYZ,L1,L2") 


a ye ee ee 
| Prepare in BLENDED_ALPHABET a blend of the lower and upper | 
{| halves of &ALPHABET. | 
Un secant retraces eases Usp Pi srs sunsets eevsmmnamnsean-enenamsasescsrenisrall 


S6ALPHABET LEN(128) . ABC LEN(128) . XYZ 


BLE_1 ABC LEN(1) .T1 = : F (BLEND_END) 
XYZ LEN(1) . 72 = 
BLENDED_ALPHABET = BLENDED_ALPHABET T1 T2 


2 (BLE_1) 
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CR RR a a A TERN IS EA A STD ERC ESIS | 
{ Entry point: If S1 is too large, subdivide and recurse. | 
| Se ee Re ne a a ee er ee ee ee ee eT TS | 


BLEND L1 = SIZE(S1) 
GT (L1, 128) :F (BLEND_1) 
EQ(L1,0) :S (RETURN) 
$1 LEN(128) . S1 REM. 1T1 
$2 LEN(128) . S2 REM. T2 
BLEND = REPLACE (BLENDED_ALPHABET, 6ALPHABET,S1 S2) 
+ BLEND (T1,7T2) : (RETURN) 


ra We On Pee ee ee ee Re Se NI eee kee Te eave ee ee 
{ Otherwise prepare AXBYCZ to be a BLEND of ABC and XYZ and | 
{ to be as long as the string to be returned. These strings { 
| serve as a template for a positional transformation of the { 
{ combined string S1 S2. { 
Ne cic cae ve eel ip ll a tinplate 
BLEND_1 L2 = SIZE(S2) 

SALPHABET LEN(*L1) . ABC TAB(128) LEN(*L2) . XYZ 

BLENDED_ALPHABET LEN (*(L1 + L2)) . AXBYCZ 

BLEND = REPLACE (AXBYCZ, ABC XYZ, S1 S2) 

-: (RETURN) 

BLEND_END 


Epiloque 


The initialization section of BLEND prepares a_e string 
BLENDED_ALPHABET which thereafter is used to obtain templates 
for a positional transformation. For very large strings BLEND 
is called recursively. As in REVERSE, this is done because of 
limitations in the size of S6ALPHABET rather than due to any 
difficulties or limitations in handling long strings in 
SNOBOL4. A slightly faster version of BLEND can be achieved 
by nonrecursive methods but it seems hardly worth it. 


ye ae ee ge oe 

{{ Program {f BALREV(S) will return the balanced reversal 
1 3.8 I of the string S. That is, the characters of 
{{ BALREV | S are reversed and the parenthesis are in- 
________________} terchanged. For example, BALREV('F(X)') is 


*(X)F' rather than ')X(F' as would be returned by REVERSE. 
BALREV can be used to reverse the order of scanning in an en- 
vironment in which BAL plays a role in the pattern matching. 
For example 

cS "(' BAL. E ‘yt 
will find the first parenthesized expression in S, whereas 


BALREV (S) "(" BAL. E ‘'y# 
E = BALREV(E) 


will set E to be the last parenthesized expression in S. 
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ocr en eee en ee ee en ee wae ee ne Cape ete Pee pe eee 
{ BALREV(S) will return the balanced reversal of S. | 
eterna ne ra msthe- rPU vests sPSPSsee cp esussunelneuishsessenecaanemenansasosnessssmansciannnesell 


DEFINE ('BALREV(S) ') : (BALREV_END) 
BALREV BALREV = REPLACE(REVERSE(S), ") (',f "() °) 
; : (RETURN) 
BALREV_END 
Names_referenced Name Type Where defined 
by _ BALREV: REVERSE Function Program 3.6 
Epilogue 


BALREV is not of interest because it offers a challenge to 
one's program-writing abilities but rather because of the 
general notion of balanced reversal that it introduces and the 
fact that we will have occasion to make use of the function in 
later chapters. It is also of interest in that it provides in 
one line of code not only a useful function but one which uses 
both a transliteration and a positional transformation. 


, ria tcatameeeaammmaricme aere ar | 
Program (available in SPITBOL and SITBOL) 


1\ 11 

11 3.9 | SUBSTR(S,I,L) will return a substring of the 
{| SUBSTR tft string S beginning at character I and exten- 
4 ding for L characters. If such a string is 
not properly included in S then SUBSTR fails. The SUBSTR 
function was patterned after the function by the same name in 
PL/I. Although the taking of a substring is a capability im- 
plicit in the pattern-matching facilities of SNOBOL4, its 
availablity as a function offers another dimension to this 
most fundamental of string operations. 


Cr I ee PE Te Ee GEE Ee Gee Lee gee ee Pee Ele ee en ee eT ee Re 

{ SUBSTR(S,I,L) returns a substring of length L beginning at | 

{ the Ith character of S. { 

|r | 
DEFINE (*SUBSTR(S,I,L) ') : (SUBSTR_END) 

SUBSTR S LEN(*(I - 1)) LEN(*L) . SUBSTR :S(RETURN) F (FRETURN) 

SUBSTR_END 


| 11 We may regard a string as a set of charac- 
tt 3.10 a ters if we ignore duplicates and their 
| 1{ ordering. The fundamental set operations 
L$ 4 are union, intersection and complementation. 
String concatenation gives us union. Intersection can be ob- 
tained from union if we also have complementation. Complemen- 
tation can be obtained if we have the universe string (set of 
all characters) and set difference. SALPHABET serves as the 
universe and DIFF(S1,S2) will return the set difference, $1 - 
S2. That is, DIFF(S1,S2) returns a string containing all those 
characters that are in S1 and not S2. 


Sp Oo PROgranm. 3.41 = SREM a PGE 5S 


DEFINE ('DIFF(S1,S2) *) : (DIFF_END) 


GF re ee ee ee ee ee ee ee ee eT ee 
{ Entry point: set DIFF to S1 and then remove any consecu- | 
{ tive string of S2 characters. { 
Gee ee a ee 


DIFF DIFF = S1 


IDENT (S2, NULL) :S (RETURN) 

S2 = SPAN(S2) 
DIFF_1 DIFF $2 = :S(DIFF_1) F (RETURN) 
DIFF_END 
poe es ce erg cee ee 
{{ Program {|{f SKIM(S) 'skims off' the first appearance of 
| 3.11 | each different character of S and returns 
11 SKIM 1 the result. Thus SKIM('MISSISSIPPI') returns 
iL —____________4 'mIsp', 

DEFINE (*SKIM(S) C*) : (SKIM_END) 


ee ee ee ee 
{ Entry point: Remove character from S and if not already | 
{ in SKIM, put it there and repeat. ] 
Niece cece seit ce a anita ee cance i anaes amiecntincaaiaciniaap el 


SKIM S LEN(1) . C = :F (RETURN) 

SKIM Cc :S (SKIM_D) 

SKIM = SKIM Cc : (SKIM) 
ee ee ens 
{ But if C was found in SKIM, it may be prudent to remove | 
{ all characters already SKIM'ed from S. { 
| ere | 
SKIM_D S = DIFF(S, SKIM) : (SKIM) 

SKIM_END 
Names_referenced Name Type Where defined 
by SKIM: DIFF Function Program 3.10 
Epilogue 


SKIM is slightly more complicated than it has to be. The line 
at SKIM_D is not strictly necessary and the statement that 
branches to SKIM_D could as well branch to SKIM. But for ef- 
ficiency purposes it is better to remove already-skimmed 
characters in the wholesale manner of DIFF rather than pain- 
fully, one at a time. The technique used in SKIM is to call 
DIFF whenever an old character is found. This will be an im- 
provement even if it takes relatively long to call DIFF. If 
the ratio of times of calling DIFF vs. going through the loop 
is 5, then it will pay if as few as 5 characters are removed 
from DIFF. It is possible, however, that the calls to DIFF 
are too frequent. It may be better to call DIFF only when, 
Say, 2 characters in a row have already been found. 
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Se er ee RA SY NCES FET SD CE ES A ED LR I CANISCSS SES TS CE NAY TE DAO ER SEIT AD OOS ES-I nA RY 


({ Prograr || There exists a built-in function in SNOBOLS 
1 3.12 {{ called LGT. LGT(S1,S2) is a predicate which 
i LEXGT {{ will succeed if string S1 is lexically 
4 greater than S2 and fail otherwise. The 
determination of lexical ordering is based on SALPHABET which 
is machine dependent and may not represent the desired 
ordering. In particular the lower case alphabet appears 
separate from the upper case alphabet so that all upper case 
letters are regarded as greater than all lower case letters. 
Thus, ‘Arabic' is considered greater than ‘zebrat. The func- 
tion LEXGT which we define below will differ from LGT in that 
the lexical ordering will not be based on SALPHABET but on a 
user-supplied transliteration table: LEX_TT. 


aa Em aR a | 
{| LEXGT(S1,S2) is a predicate to determine whether S1 is | 
{ lexically greater than $2 according to a user-supplied | 
| transliteration table in LEX_TT. | 
Ws scsicscsmet lin setae ee lp updos bhutan eds snep ened atmos 


DEFINE ('LEXGT (S1,S2) *) 


As an example, we will initialize LEX_TT to a value _ such | 
that upper and lower case letters of the same letter will | 
be regarded as being adjacent. Also letters will compare | 
lower than anything else. First form, in ALPHA, the new | 
alphabetic ordering. { 
| | 
ALPHA = BLEND(LOWERS_,UPPERS_) 
+ DIFF (&ALPHABET, LOWERS_ UPPERS_) 


rt de ee ee ee See a pee SE Oe Te ee ee eT ae 
{ Now transform this string to form a transliteration table. | 
nee enact iii iis pi iit isin nisin 


LEX_TT = REPLACE(SALPHABET, ALPHA, &ALPHABET) 
: (LEXGT_END) 
Ce ge ee ee ee TS ge Se eS ee ee PE ee 
| Entry point: translate and compare. { 
a a 
LEXGT LGT( REPLACE(S1, SALPHABET, LEX_TT) , 
+ REPLACE(S2, SALPHABET, LEX_TT) ) 
+ :S (RETURN) F (FRETURN) 
LEXGT_END 
Names referenced Name Type . Where defined 
by_LEXGT: BLEND * Function. Program 3.7 
UPPERS_ * String Program 2.1. 
LOWERS_ * String Program 2.1 
DIFF * Function Program 3.10 


* indicates name is referenced in the initialization section. 


Epiloque 


We have effectively modified LGT by modifying its arguments. 
In many problems this could ke carried one step further for 
greater efficiency. Assume that all the data that would ever 


appear for comparison purposes is coming from the normal input 
stream (under INPUT). We could convert characters as they were 
being read in via a statement such as 


L = REPLACE(INPUT, 6&ALPHABET, LEX_TT) 


But were we to do this we must be careful in using pattern 
matching so that all character strings used to specify pat- 
terns were also mapped in the same way. Thus to match the line 
L for 'CATt we would have to write: 


L REPLACE ('CAT', SALPHABET, LEX_TT) 


{lt Program || One might suspect that LEXGT provides max- 
tt 32.13 tf imum flexibility in the comparison of 
| AGT 4 strings, since one may supply one's own al- 
t—_ —_____- dS phabet. But it does not handle the important 
case in which certain distinct characters are to be regarded 
as identical for comparison purposes. In particular, the lower 
case '‘a' and upper case ‘At are normally regarded as equal for 
dictionary purposes. LEXGT would sort words 
‘able,Afghan,artist’ as ‘able, artist,Afghan' which is not the 
dictionary ordering. AGT(S1,S2) will compare 2 strings and 
return success if S1 is alphabetically greater than S2. AGT 
is blind to the distinction between upper and lower case. 
Otherwise it accepts the ordering implied by S&ALPHABET. 


rn ee ee ee a a ee 
{ AGT(S1,S2) is a predicate to determine if S1 is al- | 
{ phabetically greater than S2. Upper and lower case ver- | 
{ sions of the same letter are regarded as equal. | 
cs nei ergs sir fami ls Ce ruse ein toons aime 


DEFINE (‘AGT (S1,S2) *) 


AGT_TT = REPLACE(&ALPHABET, UPPERS_, LOWERS_) 
: (AGT_END) 
AGT LGT( REPLACE(S1, SALPHABET, AGT_TT), 
+ REPLACE(S2, S&ALPHABET, AGT_TT)) 
+ :S (RETURN) F (FRETURN) 
AGT_END 
Names referenced Name Type Where defined 
by AGT: UPPERS_ * String Program 2.1 
LOWERS_ * String Program 2.1 


* indicates name is referenced in the initialization section. 


Epiloque 


AGT and LEXGT provide 2 distinct means whereby one may alter 
the effective behaviour of LGT. If necessary, these 2 methods 
may be combined into one suitably-designed call to REPLACE. 
We leave this as an exercise. 
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Con ee ee 

{| Program |[f{ SWAP (NAME1,NAME2) will swap the values of 
11 3.14 if the named variables. Thus, SWAP(.N,.M) will 
tI SWAP tf interchange the values of N and M. 

We ctceeceeieslanrninecneieilatinescll 


DEFINE ("SWAP (SWAP_ARG1, SWAP_ARG 2) ') : (SWAP_END) 


SWAP SWAP = $SWAP_ARG1 
$SWAP_ARG1 = S$SWAP_ARG2 
$SWAP_ARG2 = SWAP 
SWAP = ¢ (RETURN) 
SWAP_END 
Epiloque 
The names of the arguments to SWAP were deliberately chosen 


strange so as to avoid collision with the outside world. The 
variable SWAP is set to null before returning because other- 
wise a value would be returned and it is conceivable that in 
some cases this would not ke desirable. 


(oa, a 

{{ Program {{ . REPL(S1,S2,S3) will do a  string-by-string 
| 3.15 {1 replacement (as opposed to a character-by- 
{1 REPL tt character replacement ala REPLACE) on the 
t-———_-———--_—4 string S1. The string S1 is scanned for 
instances of the string S2 and each is replaced by S3. Por- 


tions of S1 already scanned and the replaced string are not 
reexamined for instances of S2. 


DEFINE (*REPL(S1,S2, S3) C,T, FINDC') : (REPL_END) 
{| Entry point: Define pattern FINDC which will do a fast | 


{ scan for the initial character. | 
a a beara mie cas ie eminent aaa noma asm dinewicaarammemaeasis a 


REPL $2 LEN(1) .C = 2F (FRETURN) 
FINDC = BREAK(C) . T LEN(1) 
S2. = pPos(0) Ss2 

{ Top of loop: First remove the prefix, T; then test for | 


{ S2. | 


REPL_1 $1  FINDC = :F (REPL_2) 
si s2 = :F (REPL_3) 
REPL = REPL T S83 : (REPL_1) 

REPL_3 REPL = REPL T C : (REPL_1) 


{ Return point: The lead character, C, was not found in S1. |{ 
Mi ea aera a a os seal tate cee erence 


REPL_2 REPL = REPL S1 : (RETURN) 
REPL_END 

Names referenced Name Type Where defined 
by REPL: BREAKX Function Program 8.2 


Epiloque 


like the function COUNT, the technique used to speed the 
search is to do a fast scan (at BREAK speeds) for the initial 
character. Other than this, the coding is straightforward but 
surprisingly lengthy. 


cme 1 


{{ Program {| QUOTE(S) will convert its argument to a 
| 3.16 tt string which will resemble a SNOBOL4 expres- 
tf QUOTE | sion which, when evaluated, will yield the 
t————____.____} Original string. In the simplest case 


QUOTE (Ss) will place the string S between apostrophes. However, 
if S contains apostrophes, QUOTE will enclose these within 
double quotes. Thus 


OUTPUT = QUOTE ("DON'T") 
will print 
tPONt wen mme 


Note that EVAL(QUOTE(S)) is always equal to S. QUOTE is useful 
when preparing code. An example is given in RSELECT (Prog. 
16.7). 


DEFINE (' QUOTE (S) S1,0,00') : (QUOTE_END) 
Cr a a ee ee re ee Re ET nee Te a ae NE Re Pe Sea ee Ee ee ede ee ate 
{ Entry point: The only thing that gives us any trouble is | 
{ the single quote. If we find one we must wrap it in double | 


{ quotes and offset it with blanks. { 
ESSN ca sere oo TI re a Ee | 


QUOTE Q = ww 3; QQ = tme 

QUOTE = Q REPL(S, 2, 9 ' § Q0Q9 QQ * * Q) Q) : (RETURN) 
QUOTE_END 
Names_referenced Name Type Where defined 
by QUOTE: REPL Function Program 3.15 


PP2PIPIZPPPPZPPPPPPPPPPPAPIZPPPPIPPPPPPPPPPPPPPPPPIPPPPPPIZIPP 
222222222222 22227722277227227«EXERCISES 22727272727222272227222272272222? 
? 22222 


oeseeer ee eevee sae esses ees 


eeeceeaeoeeeteeseeeseseeeseeevrseeseseeeveseeseunsese sees neseeveesvense 


a reir 
{ Exercise 3.1 | Write RPAD in terms of LPAD and REVERSE, 
Bee es eal 


Ge a rn eee 
{ Exercise 3.2 | Write RPAD in terms of LPAD and ROTATER. 
t_-__-__-___--+_ Assume that SIZE(S) < N. 
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(CS eee ee ee ae 
{| Exercise 3.3 {| Write a function CENTER(S,N,C) for centering 


J. objects within a field of width N. 


Co ose ee ee . 

{ Exercise 3.4 | Use the REPLACE function and BLEND to 
t_____________I_ rapidly extract every other character from 
the string S, starting with the first (Assume that SIZE(S) is 
less than 2 * SIZE(&ALPHABET) and can be even or odd). This 
can be done in 2 statements. 


ee ee . 

{ Exercise 3.5 | a) Determine Sy and Se so that 
t_—__________-__-J._ _ REPLACE (S,,S0,5) realizes the positional 
transformation shown in Figure 3.3. 


b) What is the fewest number of different characters needed 
in S, and Sp». 


! { rt { 
a { ne 
{ 
cc { cc 
\ I-—— | {x f 
| { { eee 
I | 
Vv { | cc 
{ {——(|—_——+—-> | { 
| | { [| Se 
| 
ce t co 
| | bem > | { 
[| eee | ene! | 
Figure 3.3 
i. ee 
| Exercise 3.6 | a) Using REPLACE, obtain the last charac- 


t____.-_________J ter of string S. 


b) In a similar way extract the Kth character. 
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ee re te rete ee een ae RT ES Ee SN 


CE i ae 

| Exercise 3.7 { Some cyphers (called Transpositional) serve 
t_____._.-__--J to encode text by rearranging characters 
(see for example Smith [ 1955] ). The message is written ina 
rectangular matrix horizontally from left to right. The 
encoding is obtained by reading vertically. Thus, if the 
matrix is 2x6 and the message is 


ATTACK 
ATDAWN 
the encoding is 
AATTTDAACWKN 
a) Write a function TPOS(S,H,W) to encode the string S. H 


is the height and W is the width of the matrix and S is as- 
sumed to be exactly H * W characters long. 


. b) Using TPOS, find S, & Sz such that REPLACE(S,, So, S) 
will convert all strings of length H * W (Assume that H * W 
does not exceed SIZE (&ALPHABET)). 


c) Using the scheme of b) write a function ENCODE which will 
encode arbitrary length strings. Trailing characters are 
ignored. Thus, if the matrix is 7x3 and the message is 


THEBRIT 
TSHAREC 
OMING 


then the encoding is 
*TIOHSMEHI BANRRGIETC! 


(Hint: assume some character exists, say colon (:), which will 
never appear in the string to be encoded). 


(Serr ee oN 

{ Exercise 3.8 | a) Extend BLEND(X,Y) so that if string X is 
_______________J mn times longer than string Y then the 
characters of Y will be inserted at every (n+#1)st position. 
Thus BLEND(*ABCDEF!', '123') will return 'ABICD2EF3*. For ef- 
ficiency purposes, a takle of templates may be stored for the 
positional transformations. 


b) How would the new BLEND be used in the encoding of fTPOS 
(see Exercise 3.7). 


ne ae 

{ Exercise 3.9 | Assuming a function OR(S1,S2) is available 
t_-__--__—_--_——J_ for ORing the bits of the equi-length 
character strings S1 and S2 (at high speeds). Rewrite CH 
(Program 2.7) so that it performs at high speed using the 
REPLACE function. 
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Ce ee 

{ Exercise 3.10 { E contains a string representing a Fortran 
t_________ 4  =arithmetic expression which consists, pos- 
sibly, of the sum or difference of expressions E1 and E2. 
Keeping in mind that Fortran associates operators from left to 
right, parse E assigning to E1 and E2 the proper values. If E 
is not of this form go to label NOT. 


Wn arene ee be ee 
{ Exercise 3.11 | Design a ‘worst-case! (time-wise) string 


_________.______s argument for SKIM that is 20 characters 
long. 


Ce a eae oe eo ee 
{ Exercise 3.12 | Any string may be said to denote a set of 
t______.______1 characters, viz. the set of which it 


consists. Assuming that the strings denoting sets may have 
duplicate characters, write an expression to express the a) 
union and b) intersection of 2 sets S1 and S2. c) Write an 
expression to indicate the negation of S. d) Write an expres- 
sion which succeeds if set S1 equals set S2. 


(ee Re ee ee 

{ Exercise 3.13 {| Write an expression which will succeed if 
t________..____I. there are no duplicate characters in the 
string S (you may use functions defined in this chapter). 


ore ee ee ee 
{ Exercise 3.14 | Write an expression to obtain the set of 
t___._____________-J characters that occur exactly once in a 


string S. 


Co ae at te oe . 
{ Exercise 3.15 { (a) Remove leading O's from a string by 
L________.__-____—--—J means Of TRIM, REPLACE, and REVERSE. (b) 


Remove leading 0's from a numeric string S (one capable of 
being converted to integer) by means of a single operator. 


| aia a Rana see aaa | ‘ 
{| Exercise 3.16 | AGT and LEXGT represent 2 methods of effec- 
_____._________J tively modifying the lexical comparison. 
To qeneralize, let the string ALPHA denote an alphabetic 
ordering as follows. Sets of equal letters are enclosed in 
parenthesis. Otherwise the lowest to the highest character 
are ordered left to right. Characters not in ALPHA may occur 
in any order. Thus 


ALPHA = ' (Aa) (Bb) (Cc) (Dd) (Ee)... (Zz) 0123456789! 
would describe an ordering in which all the alphabetics appear 
before the numerics and in which the alphabetics are grouped 
in their normal order. (a) Write a program to convert a string 
such as ALPHA into a pair of strings A1 and A2 such that . 


LGT( REPLACE(S1,A1,A2) , REPLACE(S2,A1,A2) ) 


will compare strings S1 and S2. 


(b) If parenthesis themselves are to be included in the 
characters to be explicitly ordered a difficulty arises. 
Establishes escape conventions for parens and modify your con- 
version program accordingly. 


Qa ene gh a et 
{ Exercise 3.17 { What 3 variables may not be swapped using 
t—_____________—4 SWAP? (Prog. 3.14) 


a a TORRE, | 
{ Exercise 3.18 | Assume that input text, contained in the 
L—_—________——J string S, is a personalized message to some 
one or some organization. Within S, and embedded within paired 
#'s are SNOBOL4 expressions to be evaluated on an individual 
basis. The rest of the text is constant for each message. 
This text may have quotes embedded within it but not #'s. 
Compose, in Q, a SNOBOL4 expression which when evaluated will 
yield the desired string. For example if S is: 


DEAR MR. #NAME#: 
then a correct translation is 
'DEAR MR. ' NAME f:t 


Ce es See ee 

{ Exercise 3.19 | State which of the following are homomor- 
t____________§ phisms (h) and which of the homomorphisms 
are also transliterations (ht). (a) UPLO, (b) BCD_EBCDIC, (c) 
ROMAN, (da) HEX, (e) CH, (f£) QUOTE 


Cr... a eee 

{ Exercise 3.20 {| Some systems accept abreviations of all 
t—_—___________..-J command names. For example, DEL, DE or even 
D would be acceptable abreviations for the DELETE command 
provided this uniquely specified the command. Given a list of 
commands in the string CMD such as: 


CMD = ‘',ALLOCATE,AUGMENT, BEGIN,CHANGE, ... '! 


write a function C(S) which will determine if a given string S 
uniquely specifies a command. If it does C should return the 
command. If it does not it should fail. Hint: using COUNT 
(Prog. 3.4) the body of the routine can be written in one 
statement. 


Co ee ee 

| Exercise 3.21 { Assume that X and Y are string-valued. In 
L_____________..__-J one statement, swap X and Y without using a 
third variable. 


ee rm ee ce meter ates a apemaemanseen te 


tm nanny 
{ Exercise 3.22 | 


| nae nnn 
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What is the 
SIZE (QUOTE (QUOTE ("X*'))) ? 


value 


of 
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CONTENTS 


CRACK ceoccccccvccscccccs 4.1 
STRINGOUT cececccesccccee 4.2 
SEQ ecccccccccccccecccecs 4.3 
BOPA ceccccceccccccccccce 4M 
FIND cccccccccccccccccccse 4.5 
AI ccccccecccccccccccsace 4.6 
TRUNC conscccnccccccccccese 4.7 


CATA @eeoeoneaevaevnantcaneeeeneaeene 4.8 
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7 Chapter 4 - ARRAY FUNCTIONS 


hile strings are convenient for representing input 
data and for economizing on search time when scanning 
for patterns, arrays are quite useful when it is 
necessary to randomly alter selected portions of the 
interior of the structure. Arrays are also convenient 


when dealing with sequences of things other than characters, 
such as numbers, patterns, and strings themselves. 


To effectively use the array facility in SNOBOL4 it is impor- 
tant to have some conception as to how arrays are implemented. 
The 3 statements below allocate an array and assign values to 
its first 2 elements. Figure 4.1 indicates the data configura- 
tion after the statements are executed. 


ALPHA = ARRAY (4) 
ALPHA<1> = 16 
ALPHA<2> = ‘ABC? 
aaa ce aac aaa | 
| | 
{A | * 4 
-—_—-__,--—__—_~ { 
| ALPHA | { 
[ ee | { 
v 
"19. 0 eee ee 
| SISIIIITTTT 


el 
<1> { I 16 I 


<2> 1S | *. {—> ‘ABC! 
+ HI 
<3> 1S | 0 { 
—— I 
<4> 1S | 0 | 
eine lleriesinrsstemreinsiatiansamnenmeninall 
Figure 4.1 


The data configuration after an array allocation 
and 2 element assignments. 
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The array is a data object of type ARRAY (denoted by A in the 
datatype field of the descriptor in the variable ALPHA). The 
data object has information (denoted by cross hatching) to in- 
dicate its physical extent and upper and lower bounds. In 
addition, for every array element, there is one descriptor. 
Hence, each array element may be assigned a data object of any 
datatype; also, the objects may be of mixed type as the exam- 
ple illustrates. Thus, an array in SNOBOL4 is more properly 
regarded as an array Of variables rather than as an array of 
data. The default value of array elements is the Duet “string 
denoted by (S,0) in the figure. 


Since an array is a value, it may readily be passed from 
variable to variable. The data configuration resulting from 
the following statements is indicated in Figure 4.2. 


BETA = ALPHA 
BETA<1> = 3.7 


a as aa a a aera ae | 
{ { 
1A { Re | 
{ 
| ALPHA { { 
ena I 
{ 
{ 
1 
I 
roe 
{ {ft 
I { | 
{ Vv 
{ oe ee te eo 
{ | 47771/7/7/17177 |\ 
SSS t SS 
I { {<I> | Rf 3.7 1 
{ 
(A { * {—_’"_ «<<2> {Ss | * {———_--———> ‘ABC! 
ae St 
{| BETA | <3> |S | 0 { 
SS i Smarr | 
<4 |S | 0 | 
Cneemervenflinensnneensineninammenmousnoessnemoat 
Figure 4.2 


The data configuration after an array assignment 
(to BETA) and one element assignment. 
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The assignment to BETA is accomplished only by copying the 
descriptor in ALPHA, not by copying the array. Thus, a 
reference to BETA<1> becomes also a reference to ALPHA<1>, so 
that modification of BETA<1> implies modification of ALPHA<1>. 
This sort of collision can be avoided by use of the COPY func- 
tion. Figure 4.3 illustrates the data configuration which 
results by executing the following 2 statements in place of 
the above 2. 


BETA = COPY (ALPHA) 
BETA<1> = 3.7 


The array elements are variables and hence may be assigned any 
data objects as value, including an array. For example 


ALPHA<2> = BETA 
will result in the data configuration shown in Figure 4.4. 


Compared with the rather rich string-handling facilities in 
SNOBOL4 there is a relative lack of such facility with respect 
to arrays. Arrays may be allocated; they may be assigned 
values and these values may later be examined; and the size of 
the array may be obtained via the PROTOTYPE function. But few 
operations are supported that deal with arrays as an entire 
entity. Arithmetic operators may not be applied to arrays. 
Arrays may not be scanned for patterns; they may not be trim- 
med, or concatenated or truncated other than as the programmer 
May provide these facilities himself. 


But the way in which arrays have been implemented in SNOBOL4G 
does provide the basis for forming a more elaborate array- 
processing facility. Because arrays are represented via a 
pointer, they can readily be passed to and returned from 
subroutines; the time-consuming overhead of copying arrays 
across the boundaries of the call does not exist. Also, and 
perhaps more importantly, the user need not specify the size 
that the returned array is to he, nor need he specify the na- 
ture (i.e. the datatype) of the array elements. Indeed, the 
value returned may be scalar or array with the decision depen- 
ding on what happens at execution time. Array elements may be 
mixed, some being string, some, integer and some, even array. 
With many of the normal restrictions removed, the user if free 
to concoct seemingly wild and fanciful operations upon arrays, 
Manipulating these data objects with a degree of freedom that 
one normally associates only with strings. Several examples 
of this sort of thing follow. 


The use of descriptor notation can be cumbersome in dealing 
with an array of simple objects such as integers, reals or 
strings. Hence, where the meaning is otherwise clear, we will 
display an array of data objects in the simplified notation 
shown in Figure 4.5b. 
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| EEE | 
{ { 
{ A | * |——_——_—“ 
+ — | 
{ ALPHA | I 
v 
Coe No eee 
ALLA | 
-.——-_---_—H 
<> | I 4 16 { 
es Ht 
<2 1st *. |-—-----—— 
tH { 
<3> 4S 1 0 ! | 
-— I { 
<a> | Sf 0 | | 
| Se ea | ] 
| 
Ce ee { 
] { v 
{Aft * \--_---— ‘ABC! 
ss | 
{ BETA | I A 
Vv { 
| ee aera, | | 
| SSSSSSS111T | 
1 { 
<I> 4 R I 3.7 { { 
-——— | 
<2 1S] * |—__--—— 
I 
<€3> 1S | 0 { 
——+—— SI 
<4y> |S { 0 { 
| ee Se | 
Figure 4.3 


This figure illustrates the effect of the COPY 


function as contrasted with assignment. 
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<i> | T { 16 { 


<2 | A f 


| 
| 


-€3> 4S-4 0 { 


<u> 1s { 0 | 


: 


7 
| 


VV 
 OALALALLL LAA | 
<2> 1S | Ke f—-——> ABC! 
<3> 1S 1 0 { 


<u 1S | 0 { 


Figure 


24 


The result of executing ALPHA<2> = BETA. 
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Cree ee 
| SSAS1IAITST ae ee ee 
| <1> | "ABLE! { 
<I> (|S | an > "ABLE? ae eee ee! | 
—_-—____—_— <2> | ‘BAKER! \ 
<2> 1S { * { > 'BAKER! | 
-——+--— 1 <3> | 3.6 ! 
<3> 1 Rf 3.6 { eee enna oaareenerr 
_ <4> | 16 { 
<4> { IT I 16 { ae 

_ CREE SRPMS a em eeR rn | 

(b) 
(a) 
Figure 4.5 


(a) shows the descriptor representation Of an ar- 
ray. (b) shows a simplified representation for 
the same array. 


Ce ee en 

({ Program {ff CRACK(S,B) is used to ‘'crack' open the 
1 4.1 I string S and assign its contents to an ar- 
{1 | ray. This array is returned. Bis a break 
td character which serves to separate items in 
the string. The caller has the option of ending the string S 
with a break character, If none exists, CRACK will append one 
before further processing. Thus 


CRACK(*ABLE BAKER CHARLIE',! °) 


will return the array 


ee ee ee 
<i> | ‘ARLE! { 
+——___—_—___1 
<2> | ' BAKER! | 


tH 
<3> | *CHARLIE® { 
| Ee | 


If B is null, the individual characters are cracked apart. 


Cr eS ag EE PR ee ep ee pe ee gE ee ae 
{ CRACK (S,B) will convert from string to array breaking at | 


{ the character B. { 
a nent tmnt ete pst pp threte asnedi tne nutenioingsinntenttottnen—tiivusnemrruntianraiioaeioal 


DEFINE ('CRACK (S,B)I,PAT*) 3: (CRACK_END) 
Gr ype a pe og ee a pg NE eR Re PP Re ee 
{| Entry point: If B is null branch off to CRACK_1. { 


pt er el 
CRACK IDENT (B, NULL) 3S (CRACK_ 1) 
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Set men ae meee maar oe tm eae a sa tar ee a Sa 


ee EE ee ee eS ee 
{1 If S does not end with a break character append one. { 
PE eer eat eI SSO see TOM SEE re eS EDS Se PME R a ee eee ES SiR eR ee ee 


Ss RTAB(1) B ABORT {| REM. S = SB 
SS ee 
| Then prepare an array (CRACK) of appropriate size and as- | 
{ sign to the variable PAT a pattern to extract substrings | 


| from Ss. | 
aS eee en ree IO eee a a tI ae eee Re eee eR 
CRACK = ARRAY (COUNT (S,B)) 
PAT = BREAK(B) ~« *CRACK<KI> LEN(1) 


ag rg a ee ge eae ee eg eae og ee ae eee Re yng ee eRe See a re oN 
{ Merge here from CRACK_1. Remove the strings and insert | 
{ them into CRACK. Return when S is exhausted. | 
(hansen ennrreeanan-sr-annPsig ev ne p/P ssr  -  PPe -TE C eet eseseaanesnanseaceareesare 
CRACK_2 I= I+1 

S PAT = 2S (CRACK_2) F (RETURN) 


Ce ee ne ee eee CR ae gg ee ee ak ee Te een ee ee 
{ If no break character,allocate CRACK and assign pattern to | 
| PAT. This pattern will strip individual characters from Ss. | 
AE eve eer ee eS Uy tC re ENN RD ReaD Ne PORE MSIE CE Ser STS TTT 


CRACK_1 CRACK = ARRAY (SIZE (S)) 
PAT = LEN(1) . *CRACK<ID> 3: (CRACK_2) 
CRACK_END 
Names_referenced Name Type _ Where defined 
by_ CRACK: COUNT Function Program 3.4 
Cr ee 
Program STRINGOUT(A,SEP) will serve to convert 


tt 1 
{I 4.2 1{ from array to string. SEP contains a 
11 STRINGOUT {| separation string to be inserted between 
t_—_____________.--4 strings of the array A. Thus if A is an 
array with values 


Qe 
<1> | ‘CAT! { 
| 
<2> 4 "Doct { 


1 
<3> ( = 'MOUSE! { 
| en | 


then STRINGOUT(A,',') will return 'CAT,DOG,MOUSE'. A is as- 
sumed to be singly dimensioned with lower bound 1 and composed 
of strings or items which can be concatenated. Note that 
STRINGOUT( CRACK(S,B) ) will return S provided that S does not 
end in B. Note also that STRINGOUT( CRACK(S B,B) ) will always 
‘return S. 
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Ge ee ee Pe ee ees ETT NYE Ce me ee ep ee ae Tee ge i ee OR ee ee 
{ STRINGOUT(A,SEP) will convert from an array of strings to | 
{ a string. SEP will serve to separate the strings. { 
eee cnn er ee hr ie sesh esnesnescnpseanenananecmesresesmncesssrl 


DEFINE ('STRINGOUT (A, SEP) I') : (STRINGOUT_END) 


en Nee Pe) ee re en Pepe NO ee ee ee a ae ee ee ee 
| Entry point: Initialize I and STRINGOUT. | 
iia citi i pi lll pt i iirc liaise ii steatosis 
STRINGOUT I= 1 

STRINGOUT = A<1> :F (RETURN) 
rr re ee em ee ere ee ge ee Se ee eee ee eee ee 
{ Top of loop { 
rk i neice pccicnpnsilgi eleteellilini ei t tg teata ci eatpgicemeiaiad 
STRINGOUT_1 I= rI+1 

STRINGOUT = STRINGOUT SEP A<I> 
+ 2S (STRINGOUT_ 1) F (RETURN) 
STRINGOUT_END 


Cas ee ee ee 

{{ Program {|{{ Although it is not conceptually difficult to 
{1 4.3 (1 sequence through an array, it can bea 
tt SEQ 1 tedious exercise if it is required that we 
td do it over and over. This is especially true 
in SNOBOL4 which has no DO or FOR statement. SEQ(S,N) provides 
a sequencing capability similar to the action of a DO-loop. 
For example: 


SEQ(' A<I> =I ', .1) 


will initialize an array A such that the Ith element is as- 
signed the value I. The first arqument is a statement or 
sequence of statements separated by semicolons. The second 
argument is the name of a variable. The variable is assigned 
the values 1,2,... and the statement or statements are ex- 
ecuted for each such assignment. This is repeated until 
failure is detected on the last statement of the sequence. 
Thus 


SEQ( " A<K> = TRIM(INPUT) ; DIFFER (A<K>,*STOP')", .K) 


will read cards successively into the array A until either A 
has no more room or the word 'STOP' is encountered on the in- 
put stream. But note that if an end-of-file is encountered 
(INPUT fails) the sequencing will not be stopped. In this 
case, if no subsequent file exists, the program will terminate 
in error. 


If failure is detected on the first attempt to execute the 
statements then SEQ will return failure. This permits compoun- 
ding the iteration as in the following: 

SEQ(" SEQ(' A<I,J> = I * J*,.3)", «TI) 


The above statement will assign a value (as indicated) to each 
element of a doubly dimensioned array A. 
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Sn ey 
{ SEQ(S,N) will sequence through a set of statements until | 
{| failure is detected. The indexing variable is given by the | 
{ name N. t 
(eet eRe Se aC I I aC nC RS eC RTC Te RE 


DEFINE ("SEQ (ARG_S,ARG_NAME) ') : (SEQ_END) 


CR ee ee 
{ Entry point: Initialize indexing variable. Then convert { 
| ARG_S to code. 1 
{EE re a a Pe Ee eo OE eee ee Ee a ee TN Ee | 
SEQ $ARG_NAME = 0 

ARG_S = CODE(ARG_S ' :S(SEQ_1) F(SEQ_2) ') 


+ 3: F (ERROR) 


Gr te ge ma ee Le er Ene Tee Ee RD EE POO | ae eo ee ae CaN 
{ Increment indexing variable by 1 and spring off to com- |{ 
{ piled code. Return will be to SEQ_1 or SEQ 2. { 
| Rr ee | 


SEQ_1 $ARG_NAME = $ARG_NAME ¢ 1 :<ARG_S> 


Gece eg po ge er en eae ae ne a eet ee fee ne ee a ee ee tee ae 
{ Control flows to SEQ _2 if a fail was detected. If first | 


{ time through fail; otherwise succeed. { 
a ee 


SEQ 2 EQ (SARG_NAME, 1) :S (FRETURN) F (RETURN) 
SEQ_END : 

Cn ee ee 

(| Program {f/| Some languages such as PL/I and APL permit 
It G4 tI arrays to be arguments to arithmetic 
{1 AOPA I operators. SNOBOL&4 does not permit such 


LI operations, but functions can be written to 
serve the same purpose. The resulting function will not be as 
convenient as the built-in facility but it will be at least, 
if not more, general and will be programmer-modifiable. 
AOPA(A1,0P,A2) will return a new array whose elements are the 
result of applying the indicated operation between correspon- 
ding elements of the arrays A1 and A2. Both A1 and A2 are 
assumed to be singly dimensioned of lower bound 1. Either At 
or A2 or both may be scalar. OP is indicated by a string and 
can be any SNOBOL4S operator. Thus 


A = AOPA(A, ‘'+', B) 
will add the array A to RB. 

Cc = AOPA (A, * #,',") 
will concatenate a comma to every element of the array A. 
Geet et RE EN, Neg re RS eT ee Kien ee Re ee ee ea ee a ee ae 
{| AOPA(A1,0OP,A2) will apply the infix operator OP to cor- | 
| responding pairs of A1 and A2. An array will be returned | 
{ unless both are scalars. ] 
| ce ee | 

DEFINE ('AOPA(A1,0P, A2)S1,1,S2,S") : (AOPA_END) 


a a eg ee Ne See ee pe ee aE ee ee ee 
{ Entry point: First check datatypes. If neither is an ar- | 


{ ray we fall through the two tests, apply the OP to the two | 


{ scalars and return. { 
[ReDecal rae aN nent oe ee ei NS ae a ee ae ea ee OE EE a 


AOPA IDENT (DATATYPE(A1), ‘*ARRAY') .2S (AOPA_ 1) 
IDENT (DATATYPE (A2), ‘*ARRAY') :S (AOPA_2) 
AOPA = EVAL('A1 ' OP * A2®*) : (RETURN) 
a a nT ee AE ee NN ee a Op eo Ce Re ge Gn ee eee 
{ Al is an array; A2 is in doubt. ] 


Nicest ianacesst init isin sem si ppiis ae ieceihpcls h cabaiaiaasicall 
AOPA_1 S1 = ‘<I>! 

S2 = IDENT (DATATYPE (A2), ‘ARRAY') *<I>t 

AOPA = ARRAY (PROTOTYPE (A1) ) : (AOPA_COMMON) 
a aR SR I aR SR a a RZ a EE ERTS | 
{ A2 is an array; A1 is not. ] 
ac esses Sec epee rm emp pm ek ld sep el a aaa rea 
AOPA_2 S2 = 't<I>t 

AOPA = ARRAY (PROTOTYPE (A2) ) 
rn ee 
{ Common code { 
| CESSES eR er Te EE PN A RE I ee rE ER Te SAA TY A | 
AOPA_COMMON 

s = ¢t AOPAXKI> = At? $1 ' t op * A2t $2 


SEQ (S,.-I) : (RETURN) 
AOPA_END 
Names_referenced Name Type Where defined 
by _AOPA: SEQ Function Program 4.3 
Co ee ee 
{{ Program {f{ FIND (A,PRED) will search an array for an ex- 
{1 4.5 11 treme element. The type of extreme element 
{1 FIND t{ will be determined by the predicate PRED. 
Ge Thus 


FIND(A, 'GE*') 
will find and return the index of the largest element in the 
array A. Specifically it will return the first element in A 
which is greater than or equal to all elements of higher 
index. 

FIND(A, 'GT') 
will also return the index of the largest element. If there 
is a tie, FIND will return the index of the last such element. 
Thus 

EQ( FIND(A,'GT') , FIND(A,'GE') ) 
may fail, but 
EQ( AX FIND(A,'GTt) > , A< FIND(A,'GE') > ) 


will succeed. 
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Ae ns ee REY oe RETR TS 


The predicate may be prefixed with the '-' operator. Thus 
A< FIND(A, '=LGT') > 


will return the string lowest in alphabetic order of the 
strings of the array A. 


QQ SS ee 
{| FIND (A, PRED) will return the index of an extreme element | 
| in the array A as determined by the predicate PRED. | 
| ee en | 

DEFINE (‘FIND (A, PRED) EX, I, MAX, TEST') - 3 (FIND_END) 
ee ee See ee eg ON a rT eee aT 
{ Entry Point: Construct an expression for comparing 2 | 
{| values. Also initialize FIND and MAX, tentatively. { 
ones piesecncceaisepeepsi nl tli irene se ei sce i i il ci er ost eintaaoenaensinieasianstmaeil 
FIND 

EX = CONVERT(PRED * (MAX,TEST)' , 'EXPRESSION') 

FIND = 1 

MAX = A<FIND> 


Re et ne ee eee ae Pa ts Ge Tee ape ny ee eee pe ae gee ee, ge GE A GG ee ee 
{ Compare MAX with all elements of higher index than FIND | 
{ until failure is encountered. If no elements remain, { 
| return. | 


FIND_1 1 
TEST = A<I> :F (RETURN) 
EVAL (EX) 2S (FIND_1) | 


ee ae ee a ee ee ae Re Te Pa ae ee ee ee ee 
{ A new extreme element has been found. | 
| ee | 


MAX = TEST 

FIND = I 3: (FIND_ 1) 
FIND_END 
Epilogue 


Testing of the array is completed when a reference to A<I> 
(first statement after FIND_1) fails (indicating array 
reference out of bounds). Note that EX has been assigned an 
expression to test MAX against TEMP rather than to test MAX 
against A<I>. The reader might araque that the latter strategy 
is more efficient since it would save one instruction in the 
inner loop. That is, failure of EVAL(EX), in this case, would 
mean either failure of the predicate PRED or array reference 
out of bounds and the distinction could be made afterwards. 
But this scheme would not work because ~LGT(MAX,A<I>) actually 
succeeds if the array reference A<I> is out of bounds. That 
is to say the unary ~ operator does not merely negate the 
predicate, it negates the entire expression. In any case, the 
savings would not be very great. As we will see, assignments 
and statement overhead cost little compared with anything else 
in the language. 
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Cer ee 
{| Program {ff AI (A,TI) (Apply Index) - where A and I are 
tt 4.6 (1 arrays will regard I as a set of indices to 
t{ AI 1 be applied to the array A. The result is an 
____-_______4 array. Thus if 
Ge ee ee ee 
<i> | 'CAT® { 
-—— <1> | 3 { 
A = <2> | "DOG * { t= 
-—_-——-——_— <2> | 2 ( 
<3> | "CANARY ! { t___________I 


_ ne | 


the array returned is 


SS ee 
<1> { ‘'CANARY'* { 


b+ 
<2> | *poc' | 


| ee a neNEne | 


If I is a scalar the result will be A<I>. 


ee ee ee ee 
{| AI(A,I) will apply the indices contained in I to the array | 
| A. { 


| Cn Orn | 


DEFINE ('AI(A,I) J") : (AI_END) 


re ee ee Re aE ae emt Ne ge Ge te ee ee ee 
{ Entry point: If I is not an array, go to AI_1 where we | 
{| merely return the Ith element. { 


nat ee a a ee Ee ee | 


AI IDENT (DATATYPE (TI), ‘ARRAY'‘) :F(AT_ 1) 


re rg ee ee eR eg Ce gee Ne Lee Tee Se ee ee ee ee eS Pe ee 
| Make AI, the array to ke returned, look like I. Then apply | 
{ the indices. 


| | 


AI = ARRAY (PROTOTYPE (Tf) ) 

SEQ(' AI<J> = A<I<JI>> ', J) : (RETURN) 
AI_1 AI = A<ID : (RETURN) 
AI_END 
Names_referenced Name Type Where defined 
by AI: SEQ Function Program 4.3 
Cl... el aw a 
{{ Program {|{ TRUNC (A, L, H) will return the truncation of. 
| 4.7 t{ the singly-dimensioned array A. That is, a 
tl TRUNC {t new array will be created and returned 
LJ consisting of the elements A<L>, A<Lt1>, 


way? AKAD: 
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DEFINE (*TRUNC (A, L,H) ") : (TRUNC_END) 
TRUNC TRUNC = ARRAY(H - L + 1) 

L = L- 1 

SEQ(' TRUNCXKI>D = AKL + ID *',.T) 

: (RETURN) 

TRUNC_END 
Names_referenced Name Type Where defined 
by _ TRUNC: SEQ Function Program 4.3 
CSS ee oN : 
{{ Program {{  CATA(A1,A2) will concatenate the two arrays 
11 4.8 tI A1 and A2. Both are assumed singly- 
11 CATA I dimensioned of lower bound one. The returned 
nel array also has lower bound one. 

DEFINE ('CATA (A1,A2) I,N1°) : (CATA_END) 
CATA Ni = PROTOTYPE (A1) 

CATA = ARRAY(N1 + PROTOTYPE (A2)) 

SEQ(' CATA<I> = A1IK<ID> ', .T) 

SEQ(' CATACN1 + I> = A2<I>' , .I) 

: (RETURN) 

CATA_END 
Names referenced Name Type Where defined 
by CATA: SEQ Function Program 4.3 
eS OEE OE OE DE EE A ED EEE SE DE DD 
222222222222227222222272222 EXERCISES ?22722727222272722222222222272? 
2h 22 222022 22 2222 2222 22222 2222222 2227272277272227 
oe a ee ee 
| Exercise 4.1 | A common problem is to initialize an array 
J with a large number of strings. commonly 


this is done with assignment statements but if the list is 
long this technique can prove wearisome. Using CRACK, assign 
an array Of length 12 to the variable M assigning to M<I> the 
name of the Ith month (or an acceptable abbreviation). Thus 
M<1> = 'JAN.', etc. 


| a a IEEE ES | 

| Exercise 4.2 | Modify SEQ so that it accepts 2 additional 
t__+_______--_4 (optional) arguments. The first will bea 
lower bound (if not present the lower bound is taken to be 1) 
and the second will indicate the increment (either positive or 
negative). The default increment should, of course, be 1. 


ee eee ee ee 
{ Exercise 4.3 {| Let A be an array with lower bound 1. 
| Se eS | 


a) What will be the result of the following 2 statements? 


N = +#PROTOTYPE(A) 
SEQ(' SWAP(.A<I>, .A<N + 1 - ID)", .I) 


b) Modify the second statement above so that the array A is 
actually reversed. 


GS ae et ee 

| Exercise 4.4 | Rewrite STRINGOUT using SEQ. 

ee eran 

Ce re ee ee ee 

{ Exercise 4.5 | Assume A is an array of strings having a 


t-—________-___-! lower bound of 1. Use SEQ to find the index 
of the first element in A which begins with the character 'M'!. 


C—O 

{ Exercise 4.6 {| Modify AOPA so that if the value of OP syn- 
t______-________J3 tactically resembles an identifier, it is 
regarded as a binary function. 


SSS eee ee 
| Exercise 4.7 {| Is AOPA(A1,,A2) a valid call? If so, what 
L—____._-__-___._J. does it do? 


ee 

{ Exercise 4.8 | Write a function OPA(OP,A) which will apply 
i__-___.______..§ the unary operator OP to every element of 
the array A. 


Ce 

{ Exercise 4.9 | Write BLEND(X,Y) where X and Y are equi- 
W—____________.J_ length strings by an expression involving 
functions defined in this chapter. 


oS 

{ Exercise 4.10 | Extend AI to permit I to range over a) 
_—___________.__.§ 2-dimensional arrays, b) multidimensional 
arrays, and c) programmer-defined data objects. 


ey 
{ Exercise 4.11 | The statement 
bese oe 


SALPHABET BREAK(S) LEN(1) . T 


will assign to T the character in S lowest in the alphabet. 
Do the same using FIND and other functions defined in this 
chapter. 


SS 

| Exercise 4.12 | In TRUNC, the statement L = L - 1 could be 
L____________-4 removed if the subsequent statement were 
modified. What modification is needed? Why was it not done 


this way? 
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{ Exercise 4.13 { Write a function DO (S,N,L,U,T) where S is a 
L_________._--4 statement sequence, Nis a name, L is a 
lower bound, U is an upper bound, and I is an increment. DO 
should simulate a Fortran DO-loop. 


a ce EE | 

{ Exercise 4.14 | (a) Define a function LBOUNDS(A) which will 
t__________.______J return an array equal to the sequence of 
lower bounds of the array A. Define a function UBOUNDS(A) to 
do a Similar thing with upper bounds. For example, 
LBOUNDS (ARRAY ('3:10,-1:1")) will return an array containing 
two integers, 3 and -1. 


(b) Write a function INCREMENT(S,L,U,N) which will increment 
and return a sequence of subscripts contained in the array Ss. 
L is an array of lower bounds as might be obtained from the 
LBOUNDS function of the previous exercise and U is an array of 
upper bounds. N is the size of each of these arrays. The 
function should fail if no more increments remain. 


(c) Using the functions INCREMENT, LBOUNDS, UBOUNDS defined 
above, write a program to print out every item in an array A. 
A may have any prototype but all of its items may be assumed 
to be printable. 


ee ee eee 

{ Exercise 4.15 {| Write a function called PUSH(A,E) which 
L_____——_____-___-——! will push an element E onto an array A 
which is acting like a stack. The first element of A contains 
the index of the last element pushed. If A runs out of room, 
double its size. PUSH will return A or the newly created ar- 
ray. Routines in this section may be used if applicable. 
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READL ccccccccccccccccccs Sel 
READRL cecccccccccccccces Sel 
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SNOBOL3 had only one datatype, the string. Even the 
arithmetic facilities of SNOBOL3 were implemented as 
operations on strings of digits rather than on machine 
ui integers. Because of this historical bias, and because 
the language is extaordinarily rich in string handling, 
SNOBOL4 is still regarded by some as exclusively a string 
language. Yet, all the basic facilities which one expects in 
a list processing language have been incorporated. into 
SNOBOL4S; these include the automatic allocation and freeing of 
storage, recursive functions, the pointer, and the data struc- 
ture. Moreover, the notation is, for the most part, conven- 
tional, convenient and flexible. Were SNOBOL4 suddenly strip- 
ped of all its pattern matching capabilities, it would still 
be a powerful and convenient list-processing language. 


t.,-—I he SNOBOL series of programming languages through 
It 
ft 
if 


What do we mean by list processing? This is the kind of data 
processing in which associated data is linked together via 
pointers as opposed to an array organization in which as- 
sociated data is placed in consecutive locations. List 
processing is used whenever the association of data is likely 
to change because such change can be readily accomplished 
merely be modifying links rather than by moving data. 


A list is technically a sequence of items joined together by 
pointers and is really just a special case of an arbitrary 
linked structure. Hence ‘list processingt is a misnomer for 
what might be better termed 'link processing'. However, a list 
may contain items of any kind, including other lists so that 
arbitrary trees may be formed. Hence, a list is more general 
than what is at first blush indicated. Nonetheless, it is in- 
portant to realize that ky list processing we mean, really, an 
arbitrarily interlaced collection of data objects with the 
-possibility of loops and with no restrictions on the number of 
nodes or the number of links per node. In other words we are 
really speaking of arbitrary graphs. 


The method by which one does list-processing in SNOBOL4 is via 
the so-called programmer-defined datatype. Calling the func- 
tion DATA, one can define a new datatype. Instances of this 
datatype can be created by making what appear to be function 
calls to the name of the datatype. Thus , 


DATA (' LINK (NEXT, VALUE) *) 
LG = LINK('*XYzZ', 22) 


will first define a datatype called LINK and then assign to L 
an object whose 2 fields (viz. NEXT and VALUE) are initialized 
with the 2 values given as arguments. The result is shown in 
Figure 5.1. 


For convenience we will refer to data objects of this kind as 
structures and to an interlaced set of structures as a data 


configuration. Like arrays, structures consist of a sequence 
Of variables (one created variable for each field) together 
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[ re SET | 
| 
{LINK} *—{— 
IL | 
[ a | { 
{ 
Vv 
co Se ee 
(| *474774//44117 \ 
t-———_—_—_—+4 
NEXT {|S {  *——{———_—_—_>_ "xyz! 
a 
VALUE { I { 22 { 
| Ss Se | 


Figure 5.1 


with some miscellaneous information denoted by cross hatching 
in the figure. These fields may be referenced via function 
notation such as 


NEXT(L) = ‘ABC! 
N = VALUE(L) + 3 


Such field references may be used wherever a variable may be 
used, such as on the left hand side of an assignment (as 
above) or on the right hand side of a. variable association 
Operator (binary . or §$). As in the case of all variables, 
the field of a structure may be assigned a data object of any 
type, including another structure. Thus 


NEXT(L) = LINK() 
will allocate a new LINK structure and assign it to the NEXT 
field of L. This statement will result in the configuration 
shown in Figure 5.2. 
A field of a structure may refer to the structure in which it 
is embedded or to any part of the configuration. Thus, 
continuing 

NEXT(NEXT(L)) = L 


will produce the configuration shown in Figure 5.3. 
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| { 
{LINK | +¥—— | 
{Lit 
a Vv 
| SS4S711184''T YY. 
NEXT | LINK{ *—__ | —_______, 
VALUE { If 22 { 
a | Vv 
| 4/444 1411117 «| 
NEXT {| Sf 0 { 
VALUE {| S_ | 0 { 
Figure 5.2 


There is no intrinsic limit ‘to the number of fields of a 
structure or to the number of new datatypes that may be 
created. 


It is sometimes required that we obtain a pointer to one of 
the fields of a structure. This we may do by use of the unary 
name operator. Thus 


ny” 
{ { 

| LINK | __— | —__—___—,  ----- 
ts tt { 
(Li | | 
tJ wv | 
1 { 
t S411 1117 I 
1 { 
NEXT {LINK { +_—_ |—___—_, | 
+H | { 
VALUE | I | 22 { { | 
| a eT | Vv { 
co { 
| SS/S/1114117 \ 
( 
NEXT | LINK{ ener peeeen eee 

1H 

VALUE | S } 0 | 

[ Ee Ee eens | 


Figure 5.3 
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L = LINK() 
ALPHA = .NEXT(L) 


will result in the configuration shown in Figure 5.4. 


| Sian ae eterna aaa, | 
| { 
{LINK ene, (eee ne nee a eR 
se? : { 
{LI { 
| eee | | 

| 

{ 

| 

{ 

{ 

v 
Coe ee ee ee >. ee 
\ { {| S/SIIIIITTS 
fb} tH 
In ft *—|-——_-_--__> NEXT { S [{ 0 \ 
f-—-4 7-1 pa 
| ALPHA | VALUE |S {| 0 { 
[| ee | | ne ESN | 


PE ES SED TUE ES A A NS Se A ee a a 
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The datatype indicated for ALPHA is 'N' for NAME. We may as- 
sign any value to the variable whose name ALPHA contains, by 
using the unary $ operator. For example: 

$ALPHA = LINK() 


will result in the configuration shown in Figure 5.5. 


iyo ce gee ee 
1 t 
{LINK| Re peers eee 
nn a I 
ee I 
| ee | 1 

I 

| 

Vv 
aa Nac, Ce rt ea 
| S/VIIITIITT 
See a 
{ ALPHA | VALUE | Sf 0 { 
| en | [ Ce SNe 


Nr a ee ee 
'ALALLALLLL 


NEXT { S| 0 


VALUE { S | 0 


ene | 


. Figure 5.5 
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Two different datatypes may have the same field without fear 
of collision. Thus 


DATA (TN (VALUE, NEXT, LSON, RSON) ') 


will define a new kind of data called TN (for Tree Node). 
Executing 


T = N(16, LINK()) 
NEXT(NEXT(T)) = .T 


will result in the structure shown in Figure 5.6. 


air Op Ear Rnaran 
| 444/4111177 \ 
ae Sareea, 
VALUE { I { 16 | 


SS 
NEXT (LINK{ *—— |——— 


SS I 
LSON | S_ | 0 { { 
———— { 
RSON { S_ | 0 t ] 
eS es | { 
{ 
| 
v 
| ae Sar Ree | 
| 
I 
NEXT | Nf <4 
ae 1 
VALUE {| S_ | 0 | | 
| Sie neneneenennnens { 
{ 
I 


Figure 5.6 
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HF 


a 
{t Program |{f{ The function READL(P) will read ina se- 
{1 5.1 1 quence of items, placing them in a list, and 
it READL | return the head of the list. P is a pattern 
re to indicate the end of the list. If P is 


null (or ‘equivalently, absent) the list is read in until an 
end-of-file condition is encountered. Otherwise, it will stop 
. reading when the pattern match succeeds. It will not include 
the card matched. Thus READL(POS(0) 'STOP') will read a_ se- 
quence of strings up to but not including the first string 
having the word "STOP' in column 1. 


DEFINE ('READL (P) N,S') 
DATA ("LINK (NEXT, VALUE) ') : (READL_END) 


SS ee 
{ Entry point: If P is null, make sure the pattern will | 


| fail. | 
(Ara re NS EP a a EE | 
READL P = IDENT(P) ABORT 


en Se ee ge ee ee ge pe gg he gt ea ete ee ae ee ee ee 
{| N will be the name of the variable to receive the next | 
| LINK of the list. Initialize it to point to READL. | 
Sascha PE Ad rcee CN eee ee eS oe Se Nee ee EEC er SIT 


N = .READL 


re ee ee ee ee a ee ee en eg ee ee 
{ Top of loop: Read a card; try the pattern; append the | 


{| LINK; and update N. | 
a en ee 


READL_1 S = INPUT :F (RETURN) 

s P 2S (RETURN) 

$N = LINK( ,S) 

N = .NEXT($N) : (READL_1) 
READL_END 
ee ee 
{{ Program {f] READRL(P) will read a list in reverse. That 
1 5.2 ti is, the head of the returned list will con- 
{({ READRL 11 tain the last string read. The reversed read 
L~——-~_—-__- 4 is curiously easier to write (and keypunch) 


than READL and appears to be a more natural way of appending 
items onto a list. 


DEFINE ('READRL(P) *) 
DATA ("LIST (NEXT, VALUE) ') : (READRL_END) 


er ne ee a ee ee ee eT ee RT PP ee hee eee a ee 
| Entry point: Set P; go through the loop inserting the | 
{ latest LINK onto the front of the list. | 
ce teeters fens nurses shane uttnss tn uetsnia us ecstefseerdrarasmscamsstetemaortasenammcamacarisraestanlD 


READRL P = IDENT(P) ABORT 
READRL_1 S = INPUT :F (RETURN) 
s Pp :S (RETURN) 
READRL = LINK(READRL, S) : (READRL_ 1) 


READRL_END 
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Cage oe Pee 

{{ Program {| REVL(L) will reverse a list L. The algorithm 
tt 5.3 tI works according to the diagram in Figure 
tI tI are For simplicity the list elements have 
tS been denoted by a single cell. Also, an ar- 
row impinging onto the outline of a cell represents a pointer 
to the data object and not a pointer to any particular field 
within the data object. REVL and L work their way down the 
list with L leading the way and REVL right behind. At each 
step the NEXT field of L is made to point backward to the 
value of REVL and then the 2 variables are incremented, so 
that they always span the ‘'gap' in the chain of links. 


DEFINE ('REVL(L) T*) 


DATA (*LINK (NEXT, VALUE) ‘) : (REVL. END) 
CS ae a a ee re NE a RE Sf PO Oe Ee ee eT Pe Ee a Ie et eo ee 
{ Entry point: Return L if it is not a link. Otherwise, | 


{ initialize REVL and L to span the gap between the first | 
{ link and the rest of the list. { 
| EERE es SOS Se RTS EL EP a a a I EE a ee Se eS EY Ee ee | 
RE L REVL = L 
IDENT (DATATYPE(L), *LINK*) 3 F (RETURN) 
L = NEXT(REVL) 
NEXT(REVL) = 
ge eg ee re ee ee er ee ee ga a Ee ae eee ge ee 
| Go through loop making NEXT(L) point backward to REVL and | 
-]| walk one step forward (T is a temporary to hold NEXT(L)). { 
1 


{ Quit when L becomes NULL. 
[SE Ee Ae ae a A a a I oe OE a Ee ae ae ee RE ER! | 


REVL_1 IDENT (L) 2S (RETURN) 
T = WNEXT(L) 
NEXT(L) = REVL 
REVL = L 
L = fT : (REVL_1) 

REVL_END 

CSS ee ee 

Program LAST (L) will return (by name) the name of 


{1 tt 
1 5.4 tI the last NEXT field of a list. Thus, if L1 
(1 | and L2 are lists 

LAST(L1) = L2 


will concatenate the two lists. If the argument to LAST is 
null the function fails. Thus 


LAST(L1) = L2 : S(LAB1) 
Li = L2 
LAB1 


will concatenate L2 to L1 even if one or both of the lists are 
null. Also , 


LAST(L) = L 
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eee <——4 
{ 
t 
Cer" 
1. fh ube 
(_* (—- 
| 1 { 
[ eee { 
an 
oe | | 
1* «4 
{ t t 
toe { 
Co 1 
{ { cr 
tse H ft t 
| *——| >| * | 
{ t 
{| REVL | ee | 
rCcrerrr—r—™"7. 
{ { 
= 
ae | I Vv 
{ { | a | 
t { { 
11 *— |—— { + | ———-> eee 
+ { { 
IL toc 
[ 
Figure 5.7 


creates a circular list. 
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ee ae eee 


DEFINE ('LAST (L) ') : (LAST_END) 


{ Entry point: if Lis null, fail. ] 
LAST IDENT (L) 2S (FRETURN) 

| Seek a null NExT field. | 
LAST_1 L = DIFFER(NEXT(L)) NEXT (L) :S(LAST_1) 

{| Return the name of this field by name. | 
oes LAST = .NEXT(L) : (NRETURN) 

LAST_END 


Cao se a ae eee 


tt Programs {1 These routines are stack manipula- 
EE 555; 5.6 8-527 1} tion routines. As their names sug- 
{| PUSH, POP & TOP ff gest PUSH and POP are used to 
L______________________-4 respectively put on and take off an 


item from a stack. TOP is used to examine the last element of 
a stack without modifying it. Thus 


PUSH('ABC't) ; PUSH (3) 
will push 2 items onto a stack. 


K1 
K3 


POP ({) K2 


: TOP () 
POP() ; K4 


TOP () 


will assign to K1 the value 3, to K2 the value 'ABC', to K3 
the value ‘ABC! and will not modify K4 as the calls to TOP and 
POP fail when the stack is empty. As an added bonus, TOP and 
POP will return by name. In the case of TOP, this means that 
values can be assigned into the top element. For example, 


TOP() = ‘xyz 


will change the value at the top of the stack. PUSH returns 


item last pushed. Hence, 
PUSH() = S 


has the same effect as PUSH (S). Having been written in this 
way, PUSH can be used to push matched substrings of a pattern 
match onto a stack. For example, 


S P1. *PUSH() P2 . *PUSH() 


is a pattern matching statement which, if the match succeeds, 
cause two substrings to ke pushed onto the stack. We will re- 
quire this property of PUSH in the chapter on compiling. See 
L_ONE, Prog. 18.2. 
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DEFINE ('PUSH (X) ') 
DEFINE ("POP () *) 
DEFINE ('TOP () ' 
DATA ("LINK (NEXT, VALUE) ') 
| : (PUSH_END) 


Gc ee a a ee ee EE ph Te ene ee ee 
{ Entry point for PUSH: Just allocate a LINK and put it at | 
{ the head of the stack pointed to by the global variable | 
{ PUSH_POP. Then return the VALUE field by name. { 
tree rere er ne eee arses beth ethansngaansnansnssnsseanensonrunemnanntoonesal 
PUSH PUSH_POP = LINK (PUSH_POP,X) 

PUSH = .VALUE(PUSH_POP) 2 (NRETURN) 


Cen eG 8 Ct ee gh RP ee ae gh see ta eRe ee ei Ne 
{ Entry point for POP: If the global stack is null, fail. { 
{ Otherwise return the element and pop the stack. { 
Gasset eeetnerernetn a-ring t-s siPs ss s-s shsssfunmnnssenasnoasssaneemerenl 


POP IDENT (PUSH_POP) :S (FRETURN) 
POP = VALUE (PUSH_POP) 
PUSH_POP = NEXT (PUSH_POP) : (RETURN) 


a a ESR Raa EDP Sa AN EE I POOLS | 
{ Entry point for TOP: Return name of VALUE field by name. | 
{ Fail if none exists. { 
Grin tence cee i een i ne ca ee mm ent ai bie rem inp eigenen cantina 


TOP IDENT (PUSH_POP) :S (FRETURN) 
TOP = .VALUE(PUSH_POP) 2 (NRETURN) 

PUSH_END 

| rman | 


lt Program || COPYL will copy a list. It makes use of the 
11 5.8 11 built-in function COPY which can be used to 
tt if copy structures (as well as arrays). Hence 
a a eee if a list is a chain of LINKs then COPY will 
be used to copy each LINK in turn. If it should happen that 
the VALUE field of a list points off to some other list, then 
a recursive function call is used to copy this subsidiary 
list. No difficulty follows from this simple procedure unless 
the data configuration has loops. If one of the fields points 
back to a node which has already been copied, we need not, and 
in fact must not, make a new copy of this node. Hence we must’ 
find a method to indicate which nodes have already been 
visited. This problem is not unique to COPYL. It arises 
whenever we wish to process every node of a data configuration | 
with loops. We solve the problem here with tables. Another 
method, one involving marking the structure itself is 
described in VISIT, Prog. 5.10. 

To avoid marking structures, we keep a list of all items al- 
ready copied paired with copied counterparts. This is most 
easily done with a SNOBOL4 table. A table is similar to an 
array except that the subscripts are not restricted to in- 
tegers but may be any value. Thus 


TBL = TABLE(100) 
TBL<X> = Y 
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ee a ED CTE a cE ee a 


will assign the Xth element of TBL the value Y, no matter what 
the datatypes of X and ¥ are. The value of 100 is an estimate 
of the number of items to be placed into the table. Thus, a 
table is a kind of associative array. It is implemented as a 
collection of descriptor pairs. When items are entered or ex- 
tracted, a search must be made for the subscript. In SPITBOL 


the value is hashed so that the search is fairly rapid. In 
MAINBOL the search is linear but is not all that slow because 
only descriptors need be compared. In both languages the 


search is quite rapid for small tables. 


In our particular application we are interested in the case 
where X and Y are structures. If Lis a LINK then 


TBL<L> = COPY (L) 

will associate with that particular LINK a copy of that LINK. 
In this way, we not only mark that a LINK has been copied but 
we point directly to the copied LINK. 


All this suggests allocating a table when COPYL is first 


called. But, if COPYL is called recursively, we do not want 
to allocate a new table but rather retain the old one. This 
can be done in several ways. Two functions may be defined 


COPYL and COPYL_INT. COPYL will receive control from external 
sources; COPYL_INT will be called internally and will not al- 
locate the table. 


Another approach, one to be used here, does not require that 
another function be defined. Rather, the COPYL function is 
redefined, by itself, twice, once immediately after receiving 
control, and once immediately before returning. 


Ga ee ee Ee EE Pe Oe ee See OE ee ee Na a Ree Sy ae pat Oe Soe Pe a Se Se es ae 
{ COPYL(L) will copy a list of LINKs. The configuration may | 
{ have loops. I 
| ve rn ES | 

DEFINE ('COPYL(L) T*) 

DATA (‘LINK (NEXT, VALUE) ‘) . 

: (COPYL_END) 

a ng ee NE gt Ee tT ee ee eee ee a ee ee ae 
{ Entry point: Redefine COPYL to have a new entry point and | 
| in which T will be treated as global. I 
ee EI Ee TED | 
COPYL DEFINE ('*COPYL(L)*, "COPYL_1') 
rr a ee es ee ene ry fae gn ne ee ee ge ee a OTE oe Te aT 
| Allocate a table and call COPYL. 100 is the estimate of | 
{ the number of nodes in the list { 
ca ces hielo ket bt i i seem i Rghcmetaiamsatera caiceaietn 

T = TABLE (100) 

COPYL = COPYL(L) 
SS SS Se Se 
{| We are done! Redefine COPYL to the original definition | 
{ and return. j 
ee ; reer! | 


DEFINE ('COPYL(L) T*) : (RETURN) 
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aR a A SEE | 
{ Internal entry point: If L is not a link there is no need | 
{ to copy it. Just return L. { 
Tics arises elise oilsands 
COPYL_1 COPYL = L . 

IDENT (DATATYPE (L), *LINK') : F (RETURN) 


Ger re ag ON ay ee ge he ee | ge en ee Oe eee ae Fe gee ah ne ee ee ee ee Oe) ee 
{ Have we ever copied this LINK before? If we have, just | 


{| return the copied LINK. { 
| a en ce a ee ea ne es | 


COPYL = T<IL> 

DIFFER (COPYL, NULL) :S (RETURN) 
SS 
{ otherwise copy the LINK and indicate this fact in the | 
{ table. | 
Thassos li tlie iliac inticdaiieioianiamnnsnicisiaae 

COPYL = COPY(L) 

T<L> = COPYL 


er ee ee ee ee ee Se ee ee ee 
| Now copy the 2 fields. { 
car ee 


VALUE (COPYL) = COPYL (VALUE (L) ) 

NEXT(COPYL) = COPYL(NEXT(L) ) : (RETURN) 
COPYL_END 
eS 
(| Program {| FLD(ST,I) will return (by name) the Ith 
11 5.9 {1 field of the structure ST, failing if I ex- 
1 FLD tf ceeds the number of fields in the structure 
t—___—__—_--.--__—5 ST. It is written using 2 built-in func- 


tions, APPLY and FIELD. APPLY may be used with arbitrary 
function names as well as with fields of a structure. Note 
that APPLY returns by name (where applicable) and also note 
that FIELD requires a datatype, not a data object. 


DEFINE ('FLD (ST, I) *) : (FLD_END) 

FLD FLD = .APPLY(FIELD(DATATYPE(ST), I), ST) 

+ :S (NRETURN) F (FRETURN) 
FLD_END 

Ca ee ee ee 

{| Program {|| VISIT will visit every structure of a con- 
(I 5.10 | figuration, once and only once, calling 
| VISIT 11 PROCESS(ST) upon arrival, where ST is the 
t--——_-_________i structure visited. PROCESS represents some 


activity to be carried out and is left to be defined by the 
user. 


COPYL, in the process of copying a configuration, had to visit 
every node and we could let that function serve as a model 
from which to write VISIT. The only basic difference would be 
that, in COPYL, we knew the kind of structures we were dealing 
with and so we could reference the fields by name. In VISIT, 
the structures are arbitrary and so we must use a function 
such as FLD to sequence through every field. 
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But we will depart from the COPYL method in two other ways. 
In the first place, we would like to present a method which 
avoids recursion. In many languages recursion is either 
unavailable or inefficient. Also, recursion, if carried to 
too many levels, will result in stack overflow. Also, we would 
like to present a method of se structures which does not 
depend on tables. 


The algorithm, to be presented, was discovered independently 
in 1965 by Deutsch and Schorr and Waite; see Knuth [Vol.1, 
p-416-417). It was developed in connection with garbage col- 
lection. One phase of garbage collection is the marking phase 
when every structure which can be accessed is marked. Subse- 
quent phases insure that the marked structures are saved and 
the unmarked structures discarded. Avoiding recursion when 
garbage collecting is highly desirable if the recursion stack 
is sharing collectable storage. 


The algorithm works as follows. SON initially points to the 
root node of a tree as indicated in Figure 5.8(a), and the 
node is marked with a 1 (also shown in the figure). All poin- 
ters in the structure are examined to see if they point off to 
any aS~-yet-unmarked structure. If an unmarked structure is 
found, it is regarded as the new SON and the old son becomes 
the FATHER. If, in the new son, there is a pointer off to an 
unmarked node, the SON and FATHER descend another level. The 
pointer which had been used to point downward in the tree is 
redirected upward so that it is possible to determine from 
whence we came. The situation is depicted in Figure 5.8(b). 
Note that FATHER and SON span a 'gap* in the structure created 
by our backward pointer. This is similar to REVL. 


The backward pointers permit us to crawl back up the tree when 
we are through examining all the descendants of SON. The MARK 
serves also the purpose of denoting which field is being used 
as backward pointer. For example, Figure 5.8(c) shows the 
Situation a little later in which a mark of 2 on the grand- 
father indicates that the 2nd field is pointing to the great- 
grandfather. 


When we are done, all the marks will have been set positive. 
We cannot make all the marks 0 again using our VISIT function 
but we can make them all negative by setting SIGN = -1. VISIT 
will work properly if the initial value of the marks is < 0 so 
that this procedure can be used to restore the state of the 
configuration to one which will accept subsequent VISITs. 


We could use a table to record the marks, as we did with 
COPYL. However, a more efficient method would be to add a MARK 
field to each data structure. For example, to add a MARK field 
to the LINK data type we could execute 


DATA (* LINK (NEXT, VALUE, MARK) ") 


It is rather remarkable that we may substitute this DATA call 
for the DATA call 
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a | | arama | ae | 
{ + Ke f > | * ( >I t 
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{ | 
{ ame ! | crs | 
{ { { > | Foe | omen > | { 
(a) _——> |__| |---| |-——_— | 
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(a2 (ee earn 
FATHER SON 
Vv Vv 
co ———-; a 
{ | <-—__-——— | —* { { | 
|-——1——| _ |-—1—!I {|-——- 1——-| 
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DATA ( "LINK (NEXT, VALUE) ') 


in just about any program without modifying its behaviour. But 
it is at least inelegant, and perhaps impractical, to request 
users of VISIT to add a MARK field to every structure. Hence 
we will do this for him ky redefining the DATA function. The 
new data function will capture control of each call to DATA, 
insert a MARK field, and then call the old original DATA 
function. 


If the user is using the FIELD function, as we do in FLD, he 
may inadvertently sequence into the MARK field which is sup- 
posed to be kept invisible. But we can keep him out of the 
MARK field by redefining the FIELD function. 


ee ye OE Re Se ee SE PPE et i ere | a ee rN 
{ VISIT(ST) will visit every node of the configuration |{ 
{ headed by structure ST. Visitation consists of calling | 
{ PROCESS (ND) where ND is the node. VISIT(ST,-1) will reset | 
{ the marks. | 
{ten sapnteein nsession atria meetin urs nsession nce utr eppausatevatsetneeessi-inemmpensostanneenisenivenhorsascsioaninall 


DEFINE ('VISIT (SON,SIGN) FATHER,GS,GF,DT,I') 


PE Ee ee en ee ag I Eee TER pe PT ey eg eT Pee ey ee 
| Redefine the DATA function so that a MARK field is inser- | 
{ ted into each new datatype. { 
ere epee il asp ise ptm iidennnsneoanlanmn iorncetnit i ntsme gsahrinae osahvinaneniensacinbaninsiectinivad 


OPSYN('OLD_DATA', 'DATA') 


DEFINE ("DATA (S) ') : (DATA_END) 
DATA S ')* = ',MARK)'* 
OLD_DATA(S) : (RETURN) 


DATA_END 


ree ed a Bean Te ee Ge ee eae ge BR ee eT Oe rp ee te erg ed ee 
{ Redefine the FIELD function so that the user won't know { 
| about the MARK field. { 


a a 
OPSYN (‘OLD_FIELD', *FIELD*) 


DEFINE ("FIELD (DT, I) ') : (FIELD_END) 
FIELD 
OLD_FIELD(DT,I + 1) :F (FRETURN) 
FIELD = OLD_FIELD(DT, 1) :S (RETURN) F (FRETURN) 
FIELD_END 


Cea aE OS Sa PPA aS TS Aa GE RG SS GE a I A ED ED a OR a EES, | 
{| Initialization section for VISIT: STND_DT will match a | 
| standard datatype. { 
[a ea ea ea ee a ee a NE EE ES EE Ee ee ETD 


STND_DT = POS(0) ('STRING' | ‘INTEGER’ { ‘REAL! 
+ | "PATTERN! { *ARRAY' | "TABLE' {| 'NAME* | 
+ "EXPRESSION!® | 'CODE' | *EXTERNAL') RPOS (0) 


: (VISIT_END) 


SS SS SSS SS SS Se 
| Entry point for VISIT: The default value for SIGN is 1..{ 
{| If the datatype of the node is standard (i-e. not {| 


{ programmer-defined), just return. ( 
| EPIC nae Re Oo Pe Ie Ee Ee SIRT Se SE I Ho EE 
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VISIT SIGN = EQ(SIGN,0) 1 
DATATYPE (SON) STND_DT 2S (RETURN) 


aa a A | 
{ Control flows to VISIT_2 whenever a previously unmarked | 
{ SON is found. Here it is processed and marked and I is | 
{ initialized. { 
eres nemesis nl icc iam hematite at stich 
VISIT_2 PROCESS (SON) 

MARK(SON) = SIGN 

I= 0 


A RT ea RS a NG eR IES | 
{ Examine the Ith node of SON (GS means grandson). If GS is | 
{ an. unmarked structure, fall through. Else, loop. If no |{ 
{ more grandsons remain, go to VISIT_3. { 
| SEE ene nN Ee a EE re | 
VISIT.1 2 = IT+1 


GS = FLD(SON, 1) :F (VISIT_3) 
DATATYPE(GS) STND_DT 2S (VISIT_1) 
GT (SIGN * MARK(GS), 0) :S(VISIT_1) 


(Here re ee ne ye ee ee eee 
{ Mark the SON with the current value of I so we can pick up | 
{ later where we left off. Point back to FATHER rather than | 
{ forward to GS. | 
a CPP ev a eS Se Oe ee ee 
MARK (SON) = SIGN * I 
FLD(SCN,I) = FATHER 


Ce ee ee ee ne ae ee ee eS ee LE tae So Te OEE PEP Lee 
{ Descend down one level; then go back to PROCESS and MARK | 
{ the new SON. 4 
i ee 
FATHER = SON 
SON = GS : (VISIT_2) 


ee ee ne ee 
| Here if no grandsons are left. If FATHER is null we are | 
| done. Otherwise set GF to be the grandfather. { 
nn EE | 


VISIT_3 IDENT (FATHER) :S (RETURN) 
I = SIGN * MARK(FATHER) 
GF = FLD(FATHER, I) 


Gere te et ee EO eT eee eee ee ee ee 
{ Point back toward the SON. Then hoist up one level. { 
ac a a tS 


FLD(FATHER,I) = SON 
SON = FATHER 
FATHER = GF : (VISIT_1) 
VISIT_END 
Names_referenced Name Type Where defined 


by VISIT: FID Function _ Program 5.9 
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{ Exercise 5.1 {| Rewrite CRACK(S,C) (Prog. 4.1) to return a 
4 linked list of strings rather than an array 
of strings. 


So - oe ee 

{ Exercise 5.2 { A doubly-linked list is one in which, in ad- 
___-__________j dition to a NEXT field pointing to the next 
item on the list, there is a PREV field pointing to the 
previous item on the list. Let L be an item of such a _ list. 
Write code to remove the item from its list. 


a a a | 

{ Exercise 5.3 { Write a routine FIRST() which will remove 
u———_____-+—-4 (and return) the first item on the push-down 
stack maintained by PUSH and POP and fail if no such item 
exists. Do this (a) without modifying PUSH and POP and (b) by 
modifying PUSH so that the process of getting the first ele- 
ment is more efficient. 


Ss eee 
{| Exercise 5.4 | Modify COPYL so that it copies a configura- 
tJ tion composed of structures of arbitrary 


types. 


CoN ae 
| Exercise 5.5 | As indicated in the text, the assignment 
——___—_———J LAST(L) = L will create a circular list. 


What modification to REVL (Prog. 5.3) is required to reverse a 
circular list (the node returned should be the node originally 
given). 


CS ie Te oe 

| Exercise 5.6 { Write a routine DISPLAY(L) which will 
L____—_____J_ display a data configuration headed by L. 
The type of structures in the configuration may be dissimilar 
and arbitrary. 


[OSs ee ee a 

{| Exercise 5.7 | Write a function called MIFFLD(N,S) which 
4 will serve as a predicate to determine 
whether N is the name of a field of the structure S. The body 
of the function requires two statements. 


| Pa a a | 

| Exercise 5.8 | Modify DATA and FIELD (subfunctions of 
J VISIT, Prog. 5.10) so that every structure 
created will have not one but two additional fields MARK and 
THREAD. Moreover, arrange to sieze control at each request to 
allocate a new structure so that all structures will be 
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threaded together via the THREAD field. Rewrite VISIT so that 
by chaining down the THREAD field, the MARK field of each 
structure is initially set to 0. 


SS 
{ Exercise 5.9 {| How would you modify VISIT (Prog. 5.10) in 
t—_-_-_______._-3. order to copy an arbitrary configuration? 
(Hint: Add a field called NEW to every structure which will 
point to the copied version.) 


ee 

{ Exercise 5.10 | Two configurations are said to be isomor- 
tJ. phic if there is a one-one correspondence 
between the structures of the configurations such that if two 
structures correspond (a) they have the same type, (b) any 
field of one structure that does not have a structure as value 
mast equal the corresponding field of the other, and (c) if a 
field of one has a structure S as value then the field of the 
other must have a structure S' such that S corresponds with 
S*. Write a subroutine ISO(S1,S2) which will succeed if struc- 
tures S1 and S2 correspond in an isomorphic configuration. 
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({ «t hat is a pattern? we have used patterns throughout 
{«I/\tt the preceding sections of this book without cons- 
{7/\N\f{ ciously evoking this question. Indeed it is perhaps 
{7 N{ not strictly necessary to know what patterns are so 
is ts long as one knows how they work and what they do. 
However, patterns play such an important role in SNOBOL4 
programming and they provide such a powerful facility for 
analyzing input data strings that a strong conceptual 
framework becomes necessary in order to derive clean and ef- 
ficient implementations, resolve complex and seemingly 
ambiguous issues and contrive reasonable extensions. 


It is tempting to suggest that a pattern is a set of strings. 
Thus ; 


P = ‘AB |. tat 


would identify P as the two strings ‘AB' and 'A', Continuing 
in this vein 


P = LEN(3) 


would be the set of all strings consisting of three characters 
and 


P = ARBNO(ANY('AB')) 


would be the set of all strings (including the null _ string) 
comprised of characters chosen from the set {A,B}. FAIL, of 
course, would be the empty set. 


But what would we make of the patterns POS(n), RPOS(n), 
TAB(n), RTAB(n), BREAK(s), SPAN(s), FENCE, and ABORT which 
cannot be uniquely identified with a set of strings. Thus 
POS(n) matches the null string when it matches but it doesn't 
match all null strings, only those at position n. If we iden- 
tified POS(0) with the null string, we would be forced to 
conclude that POS(0) = POS(1) which is nonsense. By a similar 
token, BREAK(S), when it matches, will match a string not con- 
taining a character of s but it cannot be said to match all 
such strings, only those followed by a character of.s. Hence, 
although BREAK(s) can match a null string on occasion, it can- 
not be related uniquely to the null string. The strings that 
BREAK(sS) matches are determined in part by the context in 
which the strings are embedded and this is true of most of the 
‘patterns which cannot be related to string sets. 


Another difference between patterns and sets of strings is 
that a pattern, if it matches more than one string, expresses 
a preference between any two. Thus 

. 'ABE | fat 


implies that 'AB' is tried before 'A' and behaves differently 
from 


CS en en a ee ee ae en 

{ #£%% ATTERNS AND CURSORS | Patterns are more accurately 
{1s £ ———————————! thought of as recognition 
| 84% | processes operating on cursors. A cursor is a pair 
{ 1 (S,I) where S is a string called the subject and I 
{ % { is an integer marking a position in the subject. I 
t——_—-!_ is called the cursor position. A cursor points bet- 
ween characters (as opposed to at them) and therefore the cur- 
sor position ranges between 0 and the length of the subject 
inclusive. The cursor ('ABCDEF',2) is depicted in Figure 6.1. 


(oon oon OO oe BO oe OO oe ee | 
(At 1Bt fctl (Dt TEL FFI 
Ly ty ty ws es ts 
A 
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Figure 6.1 


A depiction of the cursor (*ABCDEF', 2) 


When a pattern is called upon to match, it is presented with a 
cursor called the pre-cursor and the pattern either matches or 
fails to match at that point. If it matches, there will be a 
sequence of one Or more post-cursor positions to identify the 
portion of the subject matched. A pattern P can then be 
defined as a function whose input value is a cursor and whose 
output value is a sequence of cursors. For reasons which will 
become apparent later we will use backward notation (c)P or 
simply cP to represent the application of the pattern P to its 
cursor argument c. Hence we write 


CP = [CysCar eve ] 


We will use square brackets as above to represent sequences, 
reserving braces to represent sets and parentheses for other 
kinds of scope delimitation. 


For example, if the pattern ('CDE' {| 'Ct) is applied to the 
cursor position of Figure 6.1 we have 


(‘ABCDEF',2) (*CDE' { 'C') = [5, 3] 


In the above, the cursor position 5 stands as an abbreviation 
for the cursor ('ABCDEF',5) and similarly 3 is an abbreviation 
for ('ABCDEF',3). This represents no ambiguity since the sub- 
ject does not change during a match. 
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We will use @ to represent the null sequence. Thus 
("ABCDEF',1) ("CDE' | 'C') = @ 


Two patterns are egual if they represent the same function. 
That is, if (c)P, = (c)Ps, for all c then P, = P53. 


Below are some examples of built-in patterns in SNOBOL4. L is 
the length of the subject string. When a cursor is used in an 
arithmetic context it is the cursor position that is implied. 
For simplicity, the sequence {c] is represented as simply c. 


ec POS(n) = c ifn=c 
= © otherwise 

c RPOS(n) = c if n=L-e 

= © otherwise 
¢ TAB(n) = n if n2c 

= § otherwise 
c RTAB(n) = L-n if L-n2c 

= @ otherwise 
c LEN(n) = ctn if ctn< L 

= @ otherwise 
("ABCDEF',1)BREAK(*TAF') = [5] 
‘(‘ABCDEF',2) SPAN('CAT') = [3] 
(‘A(B())CD*, 0)BAL = [{1, 6, 7, 8] 
(‘ABCDE',0)ARB = [0, 1, 2, 3, 4, 5] 


Note that in the above, most built in patterns have at most 
one post-cursor position. ARB and BAL are exceptions and these 
are regarded as having ‘implicit alternatives’. 


Unevaluated expressions within patterns may make their 
behavior vary during a match. Thus 


P = BREAK (*S) 
will succeed or fail depending on the value of S. Any such 
pattern is termed varying. For the duration of this chapter 
we will only be concerned with nonvarying patterns. 

The alternation ({) of two patterns is defined as: 


c(P; { Pe) = (cP,) (cPo) (6. 1) 


where the right hand side indicates the concatenation of the 
two sequences. 
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To define the concatenation of patterns we must extend the 
definition of pattern to operate on sequences of cursor posi- 
tions. This is easily done: 
[Cae Core eee 7 P = (c,P) - (CoP) eea (6.2) 
Note that the notation c,PcsgP is ambiguous because it can mean 
either ((c,yP)cCp)P or (c,P) (c2P) and so will be avoided. For 
completeness 
G@P = @ 
Pattern concatenation is defined as 
c(P,; P2) = (cPy)Pa2 (6. 3) 
For example 


(‘ABCDEF',2) (('CDE' | 'C') LEN(1)) {5,3] LEN(1) 


tt it 


The pattern FAIL is defined as: 


(c) FAIL 


it 
& 


for all c. Hence 
FAIL | P = P = P { FAIL 


for all P. That is, FAIL is the identity element under pattern 
alternation. Note that 


(c)NULL = c 
where NULL is the null string. This. is the identity mapping 
for cursors and hence NULL is the identity element for pattern 
concatenation. That is 

NULL P = P = P NULL 
for all patterns P. 


A pattern may have a countably infinite number of post-cursor 
positions. For example: 


(c) SUCCEED = [C, Cy Cy aoe J 


where the sequence goes on indefinitely. An infinitude of al- 
ternates, therefore, produces a well-defined pattern. Thus 


ARB = (NULL { LEN(1) | LEN(2) [ .-. ) 


may be regarded as a proper definition for ARB. Whereas the 
number of post-cursor positions of (c)ARB is bounded by the 
length of the subject and so is always finite, its finiteness 
is not in general a requirement that the pattern be well- 


A RT I A ST ee ee UEP SIE SP SS SEATS ES SES EY LD CRS PT NAOT STAC SERED 


defined. A pattern whose sequence of post-cursors is finite 
for all pre-cursors is said to be finite. If there is at least 
one pre-cursor such that the list of post-cursors is infinite 
the pattern is said to be infinite. As usual, we will hold 
that if Cc is infinite then 


Cc = € ¢t 
for all sequences C'. Thus 
SUCCEED = SUCCEED |{ P 

for all patterns P. 

It should not be here thought that the definition of pattern 
is to be restricted in any way to those patterns which are 
directly available via SNOBOL4 primitives or by combinations 
of simple operations such as alternation or concatenation. A 


pattern is any well-defined process which maps a cursor into 
cursors of the same subject. 


{ % ONLINEAR PATTERNS { ABORT is a more pungent form of 
{| #8 & FAIL. “Whereas (c)ABORT, like 
{%% %& | =(c)FAIL, contains no post-cursor positions (ABORT 
{ % %&% | always fails) ABORT differs from FAIL in that it 
( # % {| causes an immediate halt of scanning. To include 


.-_-____J ABORT in the theory it is necessary to annex a new 
element which is the value of ABORT. We write 


(c)ABORT = ¢ 


t is called the abort symbol. When it is concatenated on the 
left of any sequence of cursors it yields itself. That is 


* [Car Cae eee J = ¢ 
More generally, an extended sequence E is defined as 
E = Cy = [Cae Car eee J % 
where C is a sequence of cursor positions, possibly infinite, 


possibly null, and \ is either ¢ or @. Concatenation of ex- 
tended sequences is defined as 


(Cyd1) (C22) ®D 


% 


CiCado2 if 4 
Cada if }a 


it is easy to see that the concatenation of extended sequences 
is associative (the left most abort symbol is the important 
one no matter how the sequences are grouped) so that 


(E, Fs) Ez; = Ey, (E2 E3) (6. 4) 


We can extend the domain of patterns from mere sequences to 
extended sequences as follows: 
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(C \)P = (CP) » (6.5) 
Note that (t)P = ¢. 


An extended sequence which does not have a terminal abort sym- 
bol is called linear; otherwise it is called nonlinear. If 
for all cursors c, the value of (c)P is linear then P itself 
is said to be linear. 


The built-in pattern FENCE which matches the null string but 
causes an immediate halt of scanning (like ABORT) when backed 
into is defined as 


(c) FENCE = [c] ¢ 
Cc: ee ee ee 
$48 UNDAMENTAL PROPERTIES { The definition of concatena- 


{ % 

(% o_o —Sie tion) 3 and § aalternation of pat- 
{ ##% | terns given above (6.1) and (6.3) are still valid 
i) { with extended sequences. It follows immediately from 
{ # { the associativity of extended sequences that the al- 
u——4 ternation of patterns is associative. That is 


(P; € Po) 1 P3 = Pa | (Po | Ps) (6.6) 
We briefly introduced the notions of transformations and 
homomorphisms on strings in Chapter 3. It readily follows from 
(6.2) and (6.5) that patterns are homomorphic transformations 

on extended sequences. That is 
(Ey, Eg) P = (Ey P) (Eq P) (6.7) 

From this it follows that 

E (Py Po) = (E Py) Po (6.8) 
Thus, if a pattern is regarded as a transformation on extended 
sequences, concatenation becomes function composition. It is 


an interesting fact that function composition is always as- 
sociative. Thus 


(Py; P2) P3 = Pq (Pa Ps) (6.9) 


Proposition Concatenation distributes over alternation from 


ete Se ae ie eae ee ee eee 


the right. That is 
(Py, { Po) P 3 = Py P3 { Po P3 (6.10) 


Proof; The left hand side when applied to a cursor c will 
produce by (6.1) and (6.7) and (6.1) again 


((cP,) (cP2))P3 


= (CP,P3) (CPaP3) = c(P,P3 { Pa2P3) 
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Note that distribution from the left would depend upon 
E(P, { Poa) = (EP,) (EPz;) which is not true for arbitrary E. 
See Exercise 6.2. 


A pattern P is said to be monic if (c)P has at most one post- 


cursor. Thus 'A' { ‘AB’ is not monic but 'At { "Bt is monic 
Since both alternands could not match at the same pre-cursor 
position. Also, FENCE is monic for although (c) FENCE is ct 


the abort symbol does not count as a post-cursor position. 
Note that if M, and Mz are monic patterns then so is their 


concatenation (My Mo). 


Proposition If m is monic and linear then it distributes over 
alternation from the left. That is 


m(P, { Po) = mP, { MP o (6. 11) 
The proof of this is simple and will be left as an exercise. 


Most of SNOBOL4's built-in patterns are, as has been 
previously noted, monic. The others are referred to as having 
implicit alternatives. If a pattern is composed only of monics 
then it can be decomposed into an alternation of monics as in 
the proposition below. This yields a kind of canonical form 
for patterns. 


Proposition Let P be any pattern formed by concatenation and 


alternation of linear monic patterns and ABORT and FENCE. Then 
P can be written 


M, Ag | Mo Ao | -«- | Mn An (6.12) 
where each m(i) is linear monic and where each A(i) is either 
ABORT or NULL (the null string also serves as the null pattern 
and both differ from the null sequence, 9). 

Proof: By induction, if P has only one element and since 
FENCF = NULL | ABORT 
P is'of the indicated form. If P is of the form Py { Ps and 
both P, and Pz are in the form of (6.12), P is also. Tf P is 
of the form, P, Ps and both are of the form (6.12) we have, by 
right distribution 
Py Po = My Ay Po { .«- {| ™M An Po 


Focus on Only one term, for if we can show that each term 
reduces to (6.12), their alternation will. Consider 


mA Po 


Tf A is ABORT, the value is mA and is of the desired form. 
Otherwise apply left distribution of m over P3. 


Cor eee ee 

{ ¥88% CANNING { In the normal unanchored mode of scanning 
¥ e———" the cursor first presented to the pattern is 

{ £4888 ( (Subject,0) and upon failure is presented with 

{ £ | (Subject,1) and so forth until the pattern succeeds. 

{ ###E | That is, the effect of a pattern match is the first 

iJ. _ cursor position of 


(0 P) (1 P) wee (L P) 
if any. Here L is the length of the subject. The string 
matched is determined by the first nonempty (c P). Let (cy 
P) be the first nonempty one. Let cp be the first post-cursor 
of (cy P). Then the string bounded by cy, cp is the substring 


matched. For example, let the subject be ‘ABC! and let the 
pattern be 'AB' | 'Ct. Then the sequence 


(0 Py) (1 P) (2 PY (3 P) 
is 
{2} ® (3}] ® = [2, 3] 


The first pre-cursor position (0) and the first post-cursor 
position (2) determine the string matched ('AB'). 


If the pattern matcher is in anchored mode then the sequence 
of cursor positions of interest is only (0 P). 


Ne 
¥$% RBNO {| The function ARBNO(P) which may also be written 


{ 

1 € ———! P* is defined as 

{# ££ | 

| EEE ft 

1s % | pe = NULL [ P P*¥ (6.13) 
| SRO 


Since P* is defined in terms of itself we may well ask, is it 
well-defined? That is, does (6.13) specify one and only one 
pattern. The answer, as we will see, is yes, but the question 
is at least as intriguing as the answer. Will a pattern, in 
general, defined in terms of itself have a unique solution? 
the answer is, obviously, no since 


P = P 


will be satisfied by any pattern. Next, we might consider 
patterns having the same general form as (6.13), viz. 


P = QI Q2 P (oat) 
Will this always uniquely define P where Q, and Q2z are given? 
The answer is no, for let Q, = FAIL and let Q, = NULL. Then 
(6.14) reduces to 


P = FAIL { NULL P = NULL P = P 
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Here, as before, there are an infinite number of solutions to 
the equation. As a less trivial example, let 


POS (0) 
POS (1) 


Qa 
Qe 


Weil 


Then (6.14) has an infinitude of solutions of the form: 
P = pos(0) { POS(1) P'. 


where Pt is any pattern. (Note that POS(i) POS (3) is either 
FAIL if the arguments are unequal or POS(i) if i = j.) 


For the special case that Q, is NULL, however, we have the 
following 


P = NULL { QP j (6.15) 
can be satisfied by one and only one pattern P. 


Proof: We will prove this by providing a procedure for com- 
puting the kth cursor position (if one exists) of (c)P for all 
c and for all k. Since (c)NULL = c, the first cursor position 
of (c)P is determinable for all c, viz. c itself. This forms 
the basis of an inductive proof. Suppose that we can compute 
the first k-1 cursor positions of (c)P for all c. In some 
cases there may not be as many as k-1 in which case we would 
know all of them and also how the sequence terminated (i.e. 
with an abort symbol or not). Then to compute the k th cursor 
position of (c)P we note that 


(c)P = c¢ (c Q P) 
. Letting (c)Q = [Cae Cae «--] d we have 
(c)P = ¢ (CyP) (CaP) «+ > 


Now all that is needed to compute the k th cursor of (c)P is 
to compute the (k-1)st cursor of (c,)P if it exists. If it 
does not and if the sequence is not terminated by an abort 
symbol, we reduce k-1 by the number of cursor positions in 
(cy) P and find the required cursor position of (cg)P. In this 
way the sequence (c)P can be effectively computed for all k. 


If the argument to ARBNO is monic and if ARBNO is anchored a 
kind of backup-free scanning results which can be useful for 
selectively scanning over portions of a string. For example, 


Q 
Ss POS (0) ARBNO(Q BREAK(Q) Q { NOTANY(Q)) P 


will scan S for a substring not contained in quotes which will 
match the pattern P. 
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A reasonable exercise at this point is to demonstrate that P 
is applied at all pre-cursors not within quotes. First note 
that the argument to ARBNO is monic and linear. Next we need 
a 


Proposition Let m be linear monic. Then 
ARBNO(m) = NULL { m { m2 {| m3 { ... (6. 16) 


where m2 is m concatenated with m, m3 = m? m, etc. 


Proof: 
ARBNO(m) = m* 
= NULL {| m m* 
= NULL { m (NULL {| m m*) 
By (6.10) = NULL { m { m2 m* 


By induction it can be shown that the ith term is m to _ the 
(i-1)st power. 


Given (6.16) it should be evident that the sequence of pre- 
cursors applied to P are monotonically increasing and are ap- 
plied at all points other than within quotes. 


As another example, PL/I comments are delimited by /* on the 
left and */ on the right. To match pattern P against a string 
not contained in a comment we can execute: 


S  POS(0) ARBNO('/** FENCE ARB '*/' FENCE | LEN(1)) P 
(6.17) 
Even the most ardent SNOBOL4Y enthusiast will admit to being 
puzzled occasionally over the use of FENCE. It's double ap- 
plication in this example virtually begs for analysis. First 
note that any pattern of the form P FENCE {| M is monic for all 
patterns P and all monic patterns M. Hence the argument to 
ARBNO is monic. For any pattern P we have 


(c)P = C} 


The associated linear pattern, PL, sometimes called the linear 


part of P is defined as 
(c) PL = C 


The associated nonlinear pattern, PN, sometimes called the 


ee 


ee oe, 


(c)PN = cy 


For example, the linear part of (ANY('ABt) FENCE) is ANY ('AB") 
and, in general, the linear part of (m FENCE) for any linear 
monic m is m itself. The nonlinear part is NULL {| m ABORT. 
The linear part of a monic pattern is monic. For example, the 
linear part of ('/**" { LEN(1)) FENCE is the monic pattern that 
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matches '/*' if present or a single character if '/** is not 
present. Note that 


(c »)PL = (c PL) > 
Cc» 


(c) (PN PL) 


and hence for all patterns P 
PN PL = Pp (6.18) 


Note too that if PN is the associated nonlinear part of some | 
pattern then 


FENCE PN = FENCE = PN -FENCE (6.19) 
From (6.19) and (6.18) and associativity it follows that 
FENCE P = FENCE PL (6.20) 


for all patterns P. In what follows, let 


F = FENCE 
N = NULL 
A = ABORT 
As stated previously 
F = NIA (6.21) 


For all patterns P, using (6.21) and right distribution 

F P = PY{A (6.22) 
For all P 

PA{A = A (6.23) 


If Mis monic, it may easily be shown using (6.23) and (6.21) 
and right distribution that 


FMF = FM (6.24) 


Proposition If M is monic and if m is the linear part of M 
then 


FM* = (FM)* = F ™F)* = F m* (6.25) 


Proof: To prove the first equality, by (6.22), (6.13), (6.22), 
and (6.24) 


F M* 


| on 
Z22Em 


The last equation has the general form 


a ay a NO oe ge TIE 


P = Nf FMP 
Since (F M)* also satisfies this equation we have by (6.15) 
FM* = (F M)* 


To prove the second equality, let M, = MF. M, is clearly 
monic. By the first equality 


FM,* = (F M,)* 
Replacing M, by M F and then using (6.24) we have 
FIM F)* = (FM F)* = (F M)* 


To prove the third equality, use the fact that F M= Fam (see 
(6.20)) and the first equality to obtain 


(F M)* = (Fm* = F m* 


Let us return to our example of searching for a semi-colon not 
within comment delimiters. The pattern 


POS(0) ARBNO('/*' FENCE ARB '*/' FENCE {| LEN(1)) P 


is of the form POS(0) ARBNO(M) P where M is monic. This fol- 

lows from the fact that any pattern of the form P FENCE {| M is 

monic. Anchoring on the left with POS(0) is equivalent to 

anchoring on the left with FENCE from the standpoint of global 

scanning. By (6.25) 
FENCE ARBNO(M) P FENCE ARBNO(ML) P 

FENCE (NULL | ML { (ML)2 | ...) P 


where ML is the linear part of M. We need only show that ML 
behaves properly. From its definition there are only 3 cases 
to consider at any given cursor position. 


1) The string '/** appears at the cursor position and there 
follows a '*/' in the string. In this case the entire comment 
is matched by ML. 


2) The string '/*' appears but no following '*/' is present. 
In this case ML fails. 


3) The string '/*' does not appear at the cursor in which case 
a single character is matched. 


From this it should be clear that P is applied to all cursors 
in the order of increasing cursor position except within com- 
ments Or unclosed comment constructions. 
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‘¢ ££ ——————_—_—' terms of itself is said to be 
448% | defined recursively. In the investigation of ARBNO, 


%*%€ (| we have encountered the definition P = Q, | Qa P 

® & { where Q, and Qs were given. Even in this simple case 
3. there were values for Q,; and Qs which would lead to 
an improper definition for P even though the specific case of 
ARBNO led in all cases to a valid definition. The general case 
of recursive definition is of interest to the SNOBOL4 program- 
mer because the language permits, via unevaluated expressions, 
arbitrarily constructed recursive definitions. For example, 
the SNOBOL4 assignment 


RSE a Rane PARES ER | 

| €#%% ECURSIVE PATTERNS | A pattern P which is defined in 
{ 

{ 

{ 

{ 


P = NULL [{ ‘A* *P 
assigns to P a pattern which will satisfy the equation 
Pp = NULL {| ‘At P 


From Prop. (6.15) we know that P is well-defined and has a 
value according to (6.13) of ARBNO('A'). 


More generally, if P is assigned the value f£(*P), where f is 
some functional form, then the pattern so defined is the one 
which satisfies the equation 


“Pp = £(P) 


It may be that no pattern or more that one pattern satisfies 
the equation in which case P is not well-defined. The scanner 
typically loops for not well-defined cases. In SNOBOL4S it is 
quite easy to write a recursive definition which has more than 


one solution. For example: 
P = *Pp 


has an infinite number of solutions. It is not quite so easy 
to find a recursive definition such that there is no solution 
to P. To do so we make up a primitive pattern function called 
NOT, defined as: 


(c) NOT(P) = c if (c)P = @ 
= @ if (c)P# @ 


There surely is no solution to the equation. 
P = NOT(P) 


and hence the assignment P = NOT(*P) would lead to an ill- 
defined construct. NOT, however, is not a primitive facility 
of SNOBOL4 and, moreover, it is not known whether a recursive 
definition can be written in SNOBOL4 which does not have at 
least one solution. 
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There are many ways in which a recursive definition can be 
poorly formed in SNOBOL4 and these usually result in having 
more than one possible solution. Frequently the following 
principle is violated. 


Proposition Let A, B, C and D be patterns. If B does not 
match the null string or a string of negative length then 


P = AJBPC{D (6.26) 

has at most one solution for P. 
Proof: Let P, and Pz, be different solutions to (6.26). Let S$ 
be a string which is matched differently by P, and P2. Let c 
be the cursor in S with the largest cursor position such that 
(c)P, # (c)Pz. Then 

(cA) (cBP,C) (cD) # (cA) (CBPsC) (cD) 

(cBP,C) # (cBP,C) 
(cCBP,) # (cBP3) 

Then for some c' in the sequence (cB) we must have 


(c'Py) # (c*Pa) 


But by definition of B, c' is greater than c which contradicts 
the assumption that c was greatest. 


(6.26) can be strengthened a great deal (See Exer. 6.20) but 
this simple statement is quite powerful. For example, let 


P = ‘Bt | tat p (6.27) 
Then by (6.26), P is unique. Now 


ARBNO('A‘') ‘BS (NULL { ‘At ARBNO('A't)) ‘BE 


= "Be | 'A* (ARBNO(TAf) Bt) 
This last equation is in the form (6.27) so that 
P = ARBNO('At) *Bt 
is the unique solution for P. 
If P is given as 
P = AfBP 

where B can match the null string we can frequently formulate 
a set of solutions for P which satisfy the equation. First we 
define IF(P) as: 

IF (P) = NOT (NOT (P) ) (6.28) 


Then note that from the definition of NOT 


SS AE CE SN A SELES SS ETT SEE NN ONIST 


NULL = NOT(P) | IF(P) (6.29) 


for all patterns P. It follows that for arbitrary patterns P 
and Q: 


P = IF(Q) P {| NOT(Q) P (6.30) 
In this way we can decompose P into a number of disjoint al- 
ternatives from which we may analyze the behavior of P. Note 
from this last equation, since NOT(P) P = @, we have 
P = IF() P| (6.31) 
For example, let P be ‘defined! recursively as: 


P = LEN(1) { POS(0) P (6.32) 


By considering various disjoint situations we can reason out a 
behaviour pattern for P as follows: 


(c)P = {1, 1, «-- }] if POS(0) LEN(1) would succeed 


(c)P = ctl if NOT (POS (0)) LEN(1) would succeed 
(c)P = ? if POS(0) NOT(LEN(1)) would succeed 
(c)P = @ if NOT(POS(0)) NOT(LEN(1)) would succeed 


The question mark (?) indicates that at this set of conditions 
the equation merely says that P = P and so any pattern would 
do. Letting X indicate such an arbitrary pattern we have 


P = POS(0) LEN(1) SUCCEED | NOT(POS(0)) LEN(1) | 
POS(0) NOT(LEN(1)) xX (6.33) 


We will let the reader confirm that any pattern of the form 
(6.33) is a solution to (6.32) noting that NULL { SUCCEED = 
SUCCEED, that Py | Pp = Po { P, if Py is mutually. exclusive 
with Pz and that POS(n) NOT(POS(n)) = FAIL. 


Patterns exhibiting left recursion present ambiguous condi- 
tions which are resolved when the scanner is in a mode known 
as QUICKSCAN (the default mode). Consider 

P = p "at 4 tps (6.34) 


This equation has a_e solution P = ABORT. As we will see, 
however, in QUICKSCAN mode the pattern 


P = *p tat y tpe (6.35) 
operates as if it were defined as 
P = ‘'BAA eee " | --- { "BAA | "BA f[ Bt 
where this indicates that P matches any substring equal to a 
'B' followed by an arbitrary number of 'A's matching alter- 


nates in the order of decreasing length. The reader may easily 
confirm that this value for P also satisfies (6.34). 


es RECURSIVE PATTERNS _ = Page 115 


This is implemented roughly as follows. When *P is called upon 
to match in (6.35) the subject is reduced (on the right) by 
the minimum number of characters required by *P's subsequent 
(1 character in this case). Hence recursive plunges are taken 
until no more characters remain which breaks the loop. Some 
of the details of this process are described in the next chap- 
ter. To establish the theoretical background for understanding 
this heuristic, first note that if A does not match the null 
string or a string of negative length, then for any finite se- 
quence C 


(JA =C => C=@ (6. 36) 


This is easily seen by considering the smallest cursor posi- 
tion in C and an immediate contradiction results. 


Proposition If A does not match the null string or a string 
of negative length and if both A and B are finite linear pat- 
terns then 


P = PAB (6.37) 
has exactly one finite linear solution for P, viz. 
P = .e- | BA? | BA { BAB (6. 38) 


Proof: We first note that (6.38) is well-defined if A must 
match a nonzero length string since we can discard all alter- 
nates other than the last L where L is the length of the sub- 
ject. Using (6.37) we obtain 


cP = (cPA) (cB) (6.39) 


If (cB) = @ then, by (6.36), (cP) = @. Since (cB) is finite 
linear it may, by Exer. 6.6, be removed from both sides of 
(6.39). Letting C, be the result of this removal from cP we 
have 


Cy = cPA = (Cy (cB))A = (CA) (CBA) 


Again, by (6.36), if cBA = @ we have that C, = @. Otherwise 
we may remove cBA from both sides. Assume that Cs, is what 
remains after removing cBA from C,. Then, as before 


C2 = (C2A) (CBA?) 


this process eventually terminates with Cyn = @ and this is 
ensured by the fact that A does not match the null string. 
Hence we have 


cP = ... (cBA3) (cBA2) (cBA) (cB) 


from which we obtain (6.38). We conclude that the QUICKSCAN 
heuristic limits the solution space of (6.37) to finite linear 
solutions. On the other hand under FULLSCAN, (6.37) loops im- 
plying no such restriction on the solution space. 
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PPPPPPPPPPPPZPPPPPPPPPPPPPPPPPPPIPIPVIPPPVPPPPPIPPPPPIPPIPPIPP 
PPPVIPZIPPPVIPIZPIPIPPPFZ_-CEXERCISES 22727222222222222222227? 
PPPPPPPPPPPPPPPPPPPPPPPPPPIPPPPPPPPPPPPPPPPPPPPPPPPMPIPVIZIIPP 
er ee ee ee 
| Exercise 6.1 | Which of the following are true? 
[ ee ee | ‘ 

a) ‘At = ‘At y tar 

b) ‘At { "Bt = ANY('AB*) 

c) ARBNO('A') = NUL {| ARBNO('A') 

d) BREAK(S) ANY(S) = ARB ANY(S) 

e) far a | ‘Be = Be { tar. 

£) ANY (‘ABC’) = NOTANY (DIFF (6ALPHABET, 'ABC*) ) 

g) FENCE (P, { Po) = FENCE Py { FENCE Pz 

h) (‘AB' | "DEF') ('Gt | 'H*) = 

‘ABG! | "ABH | "DEFG*S { ‘'DEFH! 

i) ARB = ARBNO(LEN(1)) 

j) (P1 | Pz) FENCE = P, FENCE {| Pz FENCE 
Ce ee re 
{| Exercise 6.2 { While pattern alternation is defined as 
| RES eae | 


(c) (Pa f Pa) “(c)Py (Cc) P2 


it is not in general true that 
(C) (P; € Pe) = Eres (C) Pa 


where C is a sequence of cursor positions. Find a counter- 
example. 


Co ee eee 
{ Exercise 6.3 {| Reduce the following pattern to canonical 
ee ee! form ; 


('Bt { tRty {'E* { "FAt) ( tpr | 'DS*') 


Is the pattern monic? 


Ce 

{| Exercise 6.4 | In semigroup terminology a left zero z is 
t-_-___________J defined as an element such that ze = z for 
all elments e of a semigroup. What is a left zero for a) the 
semigroup of patterns with the alternation operator, b) the 
semigroup of patterns with the concatenation operator, and c) 
the semigroup of linear kut possibly infinite cursor sequences 
under concatenation? 


ee ne ee ; 
{ Exercise 6.5 { An idempotent element E for an operator * 
t-—_____-_____--I has the property that 


__.__---» Exercises for chapter 6 Ss Page 117 


E* E = £ 


Which of the following are idempotent under concatenation? 


a) BREAK(S) f) NULL 

b) SPAN(S) g) FENCE 

c) TAB(N) h) ABORT 

da) Pos(nN) i) ‘At 

e) FAIL j) ARB 
Ce eee 
{ Exercise 6.6 | Let E, and E, be extended sequences and Ca 
L___________.___§ finite linear sequence. Show that any C is 


left and right cancellative, where left cancellative is 
defined by a) and right cancellative is defined by b). 


a) C Ey 


Cc Eo => Ey = Eo 
b) Ei; C= Ez C => E, = Eo 


Show that arbitrary E are not cancellative by finding an E, E, 
and Ez such that 


c) E Ey 


EE, but &y # Eo 
da) E, E= Ep E but E, # Eos 


Demonstrate that if pattern R is finite, linear, then for any 
two patterns P, and Ps 


u 


e) R { P, R { Po => Py = Po 


f) P, | R 


Po | R => Py, = Pg 


Co ee ee 


{ Exercise 6.7 { What are the first five alternands in the 
L_____________§ expression: 


ARBNO (ARBNO (LEN (1) ) ) 


rc 


es neste ean cence a 
{ Exercise 6.8 | Show that if M is monic and P is merely any 
t-_-__.__---§ )—opattern, then 


P FENCE | M 
is monic. 
Ge eget ame 
| Exercise 6.9 { Let P = ARB ARB. Let L be the length of the 


L_______________§ Subject. How many post-cursor positions are 
there in (0) P? 
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Ce a ee oe . 
| Exercise 6.10 {| Show that the pattern matching statement 
fine cia Sd 


Subject POS(0) Pattern 


is equivalent to the statement 


Subject FENCE Pattern 
a a a | 
{| Exercise 6.11 | Let 
a eieietenieriecierciriniieniomeniniameaaial 

P = ARBNO(LEN(1) ARB) 


How many post-cursor positions are there in (0)P where the 
size of the subject is L characters? 


Li a, ha se 
| Exercise 6.12 | Prove that if m is linear monic then m(P, | 
ee et Day = ana PRs 


Cle. pe a ee ee ee . 
{ Exercise 6.13 { Which of the following patterns are neces- 
t____________.J. sarily monic? 


a)  BREAK('ABC') e) .P | ABORT 
b) POS(0) | RPOS(0) f) FENCE P 
c) ANY(S) | BREAK(S) g) P FENCE 
a) POS(N) | TAB(N) h) FENCE {| FENCE 


ee ee 

| Exercise 6.14 { Augment the pattern shown in (6.17) to skip 
t___..________-_J over material in quotes ('...") as well as 
within comments. Make sure that characters within unclosed 
quotes are also passed over. 


Cr ae ee 

{ Exercise 6.15 | Let P = ARBNO('A*® ARB 'B*). What is the 
t_-__________--s sequence of post-cursor positions for 

a) (*AB',0)P 2? b) (*tABABY,0)P ? 


c) How many post-cursors are there in (DUPL('AB‘,K),0)P ? 
a eo ee 
{ Exercise 6.16 | Using the technique of Exercise 6.14, write 


t____.___________J a pattern which will scan for a PL/I state- 
ment failing if none exists. 


| ae asa ee a ee | 
| Exercise 6.17 | Furnish a counter-example to the following 
a | 


ARBNO(P) = NULL | P { P2 | P3 | ee. 


a ee re ee cere cee eR ce cer eee OES EN TO OD SS IED a eee ee cere Se Re SE ee SEO RE DS me oe 


(tras sete Ee ee oN 

{ Exercise 6.18 | Using back-up-free scanning, write a pat- 
L-——_—__________——_1. tern which will print out all SNOBOL4 iden- 
tifiers in a string of SNOBOL4 source. Identifiers within 


quotes should not be printed. It will be OK to print out the 
S and F of GOoTO's. For example 


ALPHA = "ABC! B("X") :S (SAM) 


should print the strings 'ALPHA', "Bt, 'S' and 'SAM'. 


Re ae as: | 

| Exercise 6.19 | Let PL, and Pla be the associated linear 
L__—_________-__-_J. patterns of P, and Pz, respectively. Provide 
a counter-example to the conjecture that PL, {| PLe is the as- 


sociated linear pattern of P, | Po. 


| Pn eae SE TE 

| Exercise 6.20 {| Let £(P) be an expression involving P con- 
L.-J posed of constant patterns, alternation and 
concatenation. Show that f£(P) can be written as 


A, 1 Bi P £,(P) { Ag {| BoP fo(P) |[ Ag eee 


where A, Ay, AayecesAne Bas Boe «++ «By are patterns not in- 
volving P and £4, fo, .-- ,fn are functions. From this, show 
that if By, Bay ---,Bn do not match the null string and if no 
pattern primitive matches a string of negative length, then 


P = f£(P) 
has at most one value for P. 
Ce er ee 
{ Exercise 6.21 { Which of the following equations for P 
_-—_____________-J5 uniquely specify a pattern? If P is unique, 


give its value. Otherwise indicate a class of values (via X) 
which will satisfy it. 


a) P = RPOS(0) { BREAK(S) P 
b) P = ANY(S) { SPAN(S) P 
c) P = ANY(S) {| BREAK(S) P 
ad) P = TAB(N) {| POS(N) P 
e) P = TAB(N) { RTAB(N) P 
Gao. Sa. ee 
{ Exercise 6.22 {| let P be a pattern not matching the null 


L_-_____-________J3. string. Define P~ recursively as 
Pe = P Pr | NULL 


Show that P- is well defined. P- is called the negative ARBNO 
of P. 


Let P be given as 
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P= X | YP 4 2@ 


where Y is monic and does not match the null string. Write P 
explicitly in terms of X, Y¥, Z and the two ARBNO'S. 
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Unevaluated Expressions 


1! hile it is not strictly necessary to know how pattern 
{ matching is implemented in order to use SNOBOL4 pat- 
{7/\\]{ terns, it is necessary to be somewhat aware of the 
{7 \{ implementation in order to program efficiently and 
ts ts well. This chapter is based on the internals of three 
independent SNOBOL4 implementations, MAINBOL, SPITBOL, and 
SITBOL. 


The compiler processes all statements in a uniform manner 
without treating the pattern-matching statement any dif- 
ferently (essentially) than any other statement. Every state- 
ment is compiled into a kind of Polish notation which may be 
visualized as a tree. For example the pattern 


(‘A* BREAK('XY") | "D") (ANY('ABC') [| "HAS | *TA*) 


is depicted in Figure 7.1. An empty box denotes concatenation 
and the compiler treats | as associating to the left. 


co 
enn SS 
{ = | 
( | 
cr aoe | 
cal bola eee fb i-=—a 
1 is | { eae i 
{ { { 1 
ae | { cos 
| I——~ "Dp! ont i i-e ‘TAS 
{ ws \ \ a { 
{ I I | 
\ \ \ { 
{ | 
"At { BREAK |{ { ANY | "HA 
ey newer 
{ I 
{ { 
‘xy! "ABC! 
Figure 7.1 


The compiled form of 
("At BREAK('*XY') | '"D') (ANY('ABC') {| ‘HA! | STA‘) 


Pattern matching operates by the concerted action of a set of 
built-in monic patterns called primitives. Strings used as 
patterns, and the patterns indicated by BREAK and ANY, fall 
into this category. Abstracting Figure 7.1 to the point of 
representing all primitives by single letters we arrive at the 
diagram in Figure 7.2. 


_t—i“‘(‘“CO™™ONONCO*#PATH DIAGRAMS C~C‘“‘(SCCO#@aaggg::-*123 


[ es | 
| J { 
| | 
On ceo 
cat | ieee cnn | | ie 
| I { { meee | 
{ | ! | 
| { { { 
5VvK— { rc { 
r——— | ie~—1 c eal | ies Ly 
{ | | t 
| { | t 
{ | { | 
A B D E 
Figure 7.2 
The abstract tree of the expression: 
(‘A' BREAK('*XY') {| 'D") (ANY('ABC') {| "HA { "TAT 


This form or structure for the pattern is, however, not the 
most suitable for doing pattern matching. In Pigure 7.2 if 
nodes A and B match successfully, node D is then attempted. 
But to obtain D the scanner must go up the tree to the top 
node and back down on the right hand side to find the primi- 
tive which is to be matched next. Since ancester information 
is not present explicitly in the compiled Polish prefix this 
tree walking would be prohibitively expensive. A similar thing 
can be said about the events which occur when a primitive 
fails. The information available from the tree, while com- 
plete, does not seem to be in a form most conducive to rapid 
search. Hence, when the expression represented by the Polish 
tree is evaluated, an entirely new structure is created. An 
example of such a structure is shown in Figure 7.3. A solid 
arrow drawn from a node X¥ to a node Y indicates that if xX is 
successful Y will be matched next. Y is called the subsequent 
of X. A dotted arrow from X to Y indicates that, if X fails, 
Y can be tried immediately with the same pre-cursor position. 
Y is then called the alternate of X. 


Coe ae ey a eee 

{ #888 ATH DIAGRAMS { More formally, a path diagram is an in- 
1s £& —————’_terconnection of nodes. Each node may 
| 88% {| have a subsequent (indicated by a solid arrow) or an 
{ % { alternate (indicated by a dotted arrow) or both. 
{% { Each node has an associated primitive which is a 
J monic pattern. An s-vacancy is a node without a 
subsequent. An a-vacancy is a node without an alternate. The 
root of a path diagram is the node with no arcs directed into 
ats (It is easy to show that construction limits the number 
of root nodes to one.) 
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Figure 7.3 


The path diagram associated with Figure 7.2. 


The path diagram of a pattern consisting only of a primitive p 
is simply a node without subsequent and without alternate and 
with p as its associated primitive. The concatenation of two 
path diagrams D, Dz is found by drawing a solid arrow from 
every s-vacancy Of D, to the root of Do. The alternation of 
two path diagrams D, | Dz is obtained as_ follows: starting 
with the root of D,, search down the chain of alternates until 
an a-vacancy is found. Then draw a dotted arrow from this a- 
vacancy to the root of D>. 


It is interesting to note that the operations of alternation 
and concatenation of path diagrams are (like patterns) as- 
sociative. Hence path diagrams form a semigroup under these 
two operations. 


The pattern node contains four essential fields as indicated 
below (one more field is introduced later). 


( sco gpenunecmcpe ceemceacrcerceeace, | 
PROG {program address| 
4 


SUBS {| subsequent { 
tt] 
ALT { alternate { 
I 
ARG { argument { 


ee ere eeneenremeneenemneall 


To describe the pattern matching algorithms in SNOBOL4 we 
would declare a structure of type NODE as 


DATA ("NODE (PROG, SUBS, ALT, ARG) ‘) 
Then, to allocate a node for, say, LEN(13), we may execute 
NODE (*LENP',,, 13) 
where the label 'LENP' indicates the location which handles» 


the LEN primitive. Its encoding would be the machine language 
counterpart of the following SNOBOL4 statements. 
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a re rr care err meen mam es Mae Ome SE Ee ne SE ce RE SR TE AEE SECRET TE NE RE SAE cutee ena ED SESS EN NS ERE 


{ Is the number of characters remaining in the SUBJECT 2 | 
{ ARG(NODE)? If not, fail! 1 
[CEE ee ec Ue Re a ar aE a a aE ee | 
LENP GE(SIZE(SUBJECT) ~ CURSOR, ARG (NODE) ) 2F (F) 


re ae pe ee ae nee ee ee 
{ Otherwise compute the post-cursor position and succeed. { 
Mase ecg is mesic heli i eli maces 


CURSOR = CURSOR + ARG (NODE) : (S) 


Here F is a label in the scanner where all primitives go to 
upon encountering failure and S is the label they go to when 
they encounter success. Note that the primitive bumps the 
CURSOR. 


One may suppose that a routine to concatenate two path 
diagrams can be written in SNOBOL4 very easily. Consider the 
following attempt. 


DEFINE ('CONCAT(P1,P2) *) : (CONCAT_ END) 
ee ee eee TI EE ee Oe ee ee ew, 
{ If P1 is null, just fail! | 
a Fe 


CONCAT IDENT (P1, NULL) :S (FRETURN) 


eT ae Ee Ee ae ee eS ER a TC GER Pee ee a ee Oe te aa ee ee 
{ Otherwise fill up the S-vacancies of the alternate and | 
{ subsequent. | 
(een ene enn i DS -O e >SSTU se - Sst se rss st nosh ssi easntpsesinirnntsecsanmeasnencennsusmsssuanaoaess 

CONCAT (ALT(P1), P2) 

CONCAT (SUBS(P1), P2) :S (RETURN) 
SSS ee eS 
{ Failure to CONCAT implies that the subsequent was null. | 
{ Plug it! I 
Wartime sia inanimate act inertia ladieaciniaiel 

SUBS(P1) = P2 : (RETURN) 

CONCAT_END 


The above routine is not valid for several reasons. 1. Path 
diagrams, as we will see later can have loops and this will 
possibly ensnare CONCAT in a recursive loop. 2. If the two 
arguments, P1 and P2, are identical the result is an abomina- 
tion. 3. The algorithm modifies P1, the first pattern. This 
is only permissible if it is known that P1 is not to he used 
for any other purpose. This guarantee, of course, does not 
exist. 


All three problems can be surmounted by copying the first pat- 
tern. Copying a graph with loops was treated earlier (COPYL, 
Prog. 5.8) and that function can be modified to perform the 
concatenation. See Exercise 7.4. A similar situation prevails 
with respect to alternation. 


A much more practical method, and one that is used by most im- 
plementors, is to group all the pattern nodes together in one 
contiguous block. This not only facilitates the copy operation 
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but increases the speed of sequencing through the nodes of a 
pattern. (Exercise 7.6 explores this possibility.) Logically, 
however, it is correct to think of the pattern as being an 
inter-linked collection of nodes. 


€$% ERIVED PATTERNS {| Can a pattern be reconstructed from 
, the path diagram? The answer is yes. 
® { Let p(n) be the primitive associated with node n. 
% | The derived pattern of node n, D(n), is defined in 
€% ( terms of its associated primitive and the derived 
_——-_ patterns of its subsequent node s and its alternate 
a as follows: 


1s 
| * 
1s 
(8 
1 ¥ 


D(n) = p(n) D(s) | D(a) if a and s exist 
= p(n) D(s) if only s exists 
= p(n) { D(a) if only a exists 
= p(n) if'neither a nor s exists 


The derived pattern of a path diagram is defined as the 
derived pattern of its root. 


When the scanner is defined, it will be seen that it imple- 
ments the derived pattern. Also, it can be shown [(Gimpel, 
1971] that any pattern will equal the derived pattern of its 
path diagram. Together these two observations constitute a 
proof of the pattern matching algorithm and provides a 
theoretical basis for the extensions which follow. 


Co eee 
Program The algorithm used internally to do pattern 


(| i{ 

| 7.1 {It matching is illustrated by the function 
(1 WW SCAN. SCAN has two arguments, the LENGTH of 
t__—_—________-___-4 the subject and a pattern identified by its 
root node NODE. The subject itself is held by a global 
variable SUBJECT and the current cursor value is held in a 
global variable CURSOR. There are good reasons for the selec- 
tion of which quantities are to be passed to SCAN and which 
quantites are global. These reasons. will be evident when 
Unevaluated Expressions are discussed. 


The initial value of CURSOR is set by a driver program called 
MATCH (Exercise 7.8). In unanchored mode, if SCAN fails, MATCH 
increments this pre-cursor by 1 and calls SCAN again. The al- 
gorithm requires a stack and the familiar operations of PUSH 
and POP. The driver program initializes things by pushing a 
null alternate and a pre-cursor value. 


Pe oy Sere ee ye ae ee Ee EM ee ae ee ee ee me ae ee ea he 
Basic SCAN function. The pattern identified by its root 
node NODE is matched against the SUBJECT at a pre-cursor 
position given by the global variable CURSOR. CURSOR is 
updated on success. The stack is another global quantity 
which SCAN modifies as a side-effect. If it fails, the 
Start-up alternate-cursor pair are popped. On success, a | 
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{ sequence of alternates May remain on the stack. | 
Dailies ie il trib iil be hcl sees et aati tla aateai ihn 
DEFINE ('*SCAN (LENGTH, NODE) ') 
DATA (‘NODE (PROG, ALT, SUBS,ARG)") 3: (SCAN_END) 


aaa Ra a IN a a a ae ES | 
! Entry point and top of loop: If an alternate to the cur- | 
{ rent node exists, push the alternate and the current | 
{ cursor. { 
t tenner ntinenntncnel 
SCAN (DIFFER (ALT(NODE)) PUSH (ALT (NODE)) PUSH (CURSOR) ) 

{ Go to the program label associated with the current node. | 
{ Return arrives at either S or F. { 
Mccoy ciate sib shaogo rs msonh sh owewsieilascaiacinicaaniteeiaeaipaisocasiaaienicill 


: (SPROG (NODE) ) 


ee ee 
{ Here on success. Set NODE to the subsequent. If there is | 


{ none, we are done; report success. Otherwise go back to | 
{ SCAN. I 
cee ea merge ee ee a Ee 
iS NODE = SUBS(NODE) 

IDENT (NODE, NULL) :S (RETURN) F (SCAN) 
= 


on Nico ee ore aD en a ren ee Le eevee ee ST 
{ Here on failure. Pop the stack for an alternate. If null, | 
{ fail. Otherwise attempt to SCAN at this node. { 
Neen cinerea emit ensembles ones emasepptmiaas niatsinataeisineeninlianeeinatcniaesiniianisnercll 


F CURSOR = POP() ; NODE = POP() 
IDENT (NODE) : S (FRETURN) F (SCAN) 
SCAN_END 
Names_referenced Name Type Where defined 
by_SCAN: PUSH Function Program 5.5 
POP Function Program 5.6 


EURISTICS { Each implementation contains a certain 
c——_————’_ number of so-called pattern matching 
{ heuristics which are intended to increase the speed 
{ of matching while having minimal effects upon the 
{ success or failure of the match. Generally they fall 
_——_—__J into two categories, those which speed up matching 
without affecting the overall outcome of the match (termed 
unobtrusive) and those which may have some effect on the out- 
come of the match (obtrusive heuristics). The programmer may 
turn off all heuristics by setting &FULLSCAN = 1 in which case 
he is said to be matching in FULLSCAN mode. Otherwise he is 
operating in QUICKSCAN mode. At this writing he cannot selec- 
tively turn off individual heuristics or, for example, choose 
the unobtrusive but suppress the obtrusive heuristics. There 
are four heuristics: futility, length-checking, start-up and 
recursive reduction. None of these heuristics are in- 
trinsically obtrusive kut under certain assumptions they may 
indeed become obtrusive. There is a fifth heuristic which is 
a protection heuristic as opposed to a speed heuristic. Its 
purpose is to catch programming errors. The pattern 
ARBNO(NULL) will loop forever in FULLSCAN mode. In QUICKSCAN 
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mode, the scanner checks the number of characters matched by 
the argument to ARBNO and terminates if 0 characters were mat- 
ched. Some implementations have not included this heuristic 
and its inclusion in a language which permits arbitrary state- 
ment looping seems questionable. We will not consider it 
further. 


Futility - Under FULLSCAN the driver program successively 
‘calls SCAN for all cursor values with the given subject in the 
order of increasing cursor position. But such a procedure can 


be woefully time-consuming as in the following common example. 
Ss BREAK(';') . K 


which causes string S to be scanned for a semicolon and, if 
found, assigns the initial substring to K. Under FULLSCAN, a 
failure at CURSOR = 0 will cause a repeat at CURSOR = 1 which 
will necessarily also result in failure, etc. A total of L + 1 
scans will be made where L is the length of the string. The 
wary user can anchor the scan either by prefixing a POS(0) to 
the pattern or by using S&ANCHOR mode. However under QUICKSCAN 
mode, the futility heuristic will cause an abrupt halt of 
scanning after the first failure. 


A pattern is said to te futile for a certain cursor c if it 
fails at this and all advances of the cursor position. That 
is, tf 


(c')P = @ for all c' 2c 


then P is futile for cursor c. If BREAK(S) fails at cursor c 
it is also futile at cursor c. Hence, in the above example, 
additional scanning at advanced cursor positions is not 
needed. But it is not always possible to make a simple test 
to determine the futility of a pattern. If the pattern is the 
string 'XXX' and the subject is *ABCDE' the pattern is futile 
for any cursor position but normally this is not discovered 
until after at least 3 attempts are made to match ‘'XXX'. 
Hence, string patterns report futility only when there is 
insufficient length in the subject string. This is termed 
length failure. For convenience, whenever a primitive detects 
futility, it is said to experience length failure, or simply, 
to length fail. Thus, when BREAK fails, it reports length 
failure even though, strictly speaking, the futility is not 
due to an insufficient number of characters. 


If a pattern primitive detects that it is futile, it branches 
to a length-failure exit (LF). Otherwise it branches to match- 
failure (MF). Both of these are in lieu of the single fail 
location (F) in the function SCAN. Most pattern primitives 
can transmit futility detected by a subsequent. This means 
that if pz is the subsequent of p,, and if pz reports length 
failure, p, can also report length failure. More formally, 
the primitive p is called a transmitter if, whenever any pat- 
tern P is futile at cursor c, and if (c')p = c, then (p P) is 
futile at c'. 
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A necessary and sufficient condition that a monic pattern p be 
a transmitter is that p ke monotonic in the sense that any 
increase in pre-cursor position brings about a non-decrease in 
post-cursor position. Virtually all primitives in SNOBOL4 are 
monotonic. Hence the scanner makes the assumption that all 
primitives are transmitters. Under the transmitter assumption, 
if all local failures are length-failures then the overall 
pattern is futile. 


For example, let 


Subject: VA BC co oid «ee web 6 ose ewe oD" 
Pattern: ‘ABC! BREAK('D'‘) 'DEt 


Then the 'DE' when matched against the 'D* will length-fail 
indicating futility. BREAK(*D') is a transmitter since its 
vost-cursor position cannot possibly back-up if its pre-cursor 


advances. Hence (BREAK('D') 'DE‘t) is futile. By a similar 
line of reasoning, 'ABC' is also a transmitter and hence the 
entire pattern is futile. The initial cursor position, 


therefore, need not be advanced beyond 0. 


The futility heuristic is implemented by a global flag which 
is set on at the start of a scan and is turned off at any 
match-fail or if a non-transmitter succeeds. The flag is 
called the futility flag. If the futility flag is on when the 
overall pattern fails, it is useless to go on. The overall 
pattern is futile. 


The futility heuristic is unobtrusive for patterns which are 
nonvarying. For varying patterns the heuristic becomes obtru- 
sive. For example, the pattern matching statement 


'ABXB! ANY('AB') $C BREAK(*C) 


will first assign '‘'A' to C and the pattern BREAK(*C) will 
fail. BREAK signals length failure and the scanner erroneously 
concludes that the entire pattern is futile. Should the pat- 
tern be matched with a pre-cursor of 1, C would be assigned 
the character ‘'B' and the subsequent BREAK would succeed. 
Hence the pattern was not futile. The difficulty stems from 
the fact that BREAK lied. If its argument is indeed an 
unevaluated expression, it should not signal length failure 
unless there are no characters left in the string. 


ARB is a pattern which can use the futility heuristic in two 
ways to hasten scanning. If the subsequent to ARB is futile 
at any given cursor then ARB need not extend. Moreover, (ARB 
P) where P is the subsequent will be futile. For example: 


Subject: "AXXXBXXX'* 
Pattern: ‘At ARB 'f'Bt ARB 'C! 


In the above, the ‘A' will be matched against the first 
character. ARB will match 0, then 1, 2, and 3 characters until 
'B' succeeds. The second ARB will match 0, 1, 2 characters 
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until 'c' is futile. Hence, ARB 'C! is detected as being 
futile at position 5 and ARB 'B' ARB 'C' is detected as futile 
at position 1. The scanner can halt immediately. The futility 
heuristic for ARB is implemented by pushing the original state 
of the futility flag onto the stack. When the subsequent to 
ARB signals futility ARB restores the state of the futility 
flag and takes the length-fail exit. If ARB receives no in- 
dication of futility for all post-cursor positions up to and 
including L, the length of the subject, then ARB should in- 
dicate match failure. 


Start-up Heuristic - the start-up heuristic permits a pattern 
beginning with POS(n) to be applied immediately at CURSOR = 
n. The effect is an anchored mode except that the anchoring 
is done at a position other than CURSOR = 0. Both SPITBOL and 
SITBOL use this heuristic and SITBOL also uses a_ similar 
heuristic for patterns beginning with RPOS(n). Another start- 
up heuristic exclusive to SITBOL is so-called contextual 
anchoring. Many patterns will only match substrings beginning 
with certain letters. For example SPAN(*ABC') can only match 
a substring starting with one of these 3 letters. The pattern 
‘CAT’ { ‘'DOG' will match only a string beginning with 'C' or 
pt, Rather than call SCAN at each cursor position, it is 
faster if the driver program makes a rapid pre-scan (at BREAK 
speeds) to a point where a pattern would find a letter that it 
could possibly begin matching. Failure at the first contextual 
anchor point implies a repeated attempt to scan for the next 
contextual anchor point. The alternation of two patterns 
which are both contextually anchored is also contextually 
anchored by the union of the anchoring sets. The concatenation 
of two patterns is always anchored by the anchoring, if any, 
of the left-most pattern. The start-up heuristics in all their 
variations are unobtrusive. 


Length Checking - This check operates as follows. In the 
course of building a pattern the pattern builder deduces a 
minimum length for each node. During a match, if the number 
of characters remaining in the subject is below this number, 
then the node can immediately siaqnal length-failure. The dif- 
ficulty with this technique is that it takes time to make this 
test and it effectively duplicates another test made concur- 
rently, the futility check. For example suppose the pattern 
is the string ‘ABC’. Suppose the subject is '1234567'. The 
minimum length required ky the pattern is 3. The length check 
is made 6 times. The first 5 times indicates that there is 
sufficient room in the subject. The last time a check is made, 
the length fail exit is given. However if the primitive were 
given control it would also have length failed so that the 
test is redundant. Moreover the primitive could have deduced 
that after the 5th time it was futile. If it signals length 
failure when there are 3 characters remaining (which it should 
ideally do) then the minimum length chéck never gets a chance 
to signal length failure. All of its activity went to increase 
the time of scanning. The length test came historically before 
the futility heuristic and its retention is probably for that 
reason. 


Sa COMPOUNDS = a Page 131 


Length-checking would not be obtrusive if it were not for the 
so-called one-character assumption. Any unevaluated expression 
is assumed to match at least one character. For example 


(LEN(1) $ X)  (LEN(1) $ Y¥) ¥*LGT(X,Y) 


will look for two characters out of order in a string. Unfor- 
tunately, if the two characters are the last two of the 
string, it will not find them because the predicate is assumed 
(erroneously) to consume one character. This is perhaps’ the 
most obtrusive heuristic of them all since the case of 
predicates within a pattern are extremely common and would be 
even more so if it were not for this heuristic. The lLength- 
test heuristic appears only in SPITBOL and MAINBOL. SITBOL 
and FASTBOL avoid this test for the reasons indicated. 


Recursive Reduction - This refers to the scheme whereby 
SNOBOL4 is able to break left-recursive loops as in the 
pattern: 


P = *P tat y tpe 


We will defer a discussion of this heuristic until after the 
implementation of recursive patterns is considered. 


SS 
{ #88 OMPOUNDS | Some built-in patterns are not implemented 
{ m——————_ by a _ single node, either because they are 
1 { not monic or because it is more efficient to imple- 
( # { ment them as several nodes rather than one node. 
| 88% | These patterns are predefined by a path diagram of 
i—___-__!_ two or more nodes and are called compounds. Examples 
of compounds are the patterns with implicit alternatives such 
as ARB, BAL, and ARBNO(p). 


ARB 
A pattern which does nothing but succeed is called nil. The 
node for nil is shown below 

{ s 1 

{ subsequent { 

{ alternate { 


{ { 


ee | 


where S refers to that label in the scanner to which control 
is passed in the event of a successful match. Since the primi- 
tive is effectively short-circuited, this is the fastest 
possible successful pattern. The null string may be coded as 
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the nil node (it is not normally). There is nO argument for 
nil. 
ARB can be thought of as being recursively defined as 

ARB = NULL | (LEN(1) ARB) 


and this leads to the compound shown in Figure 7.4. Here, ‘a’: 
denotes the alternate to ARB and 's' denotes its subsequent. 


a c——1 LEN(1) | 
i { { 
A { p eee eee 
: { A 
: { : 
: { 7 
cc | ——s 
{ nil | >] nil |-—————> s 
> 
{ | | 
Figure 7.4 


A compound for ARB. 


Figure 7.4, though conceptually simple, is not the most ef- 
ficient form of ARB. The futility heuristic as applied to ARB 
needs to be implemented (see Futility) and more scanner ac- 
tivity can be incorporated within the ARB compound with a 
conseguent gain in efficiency. The more efficient ARB realiza- 
tion is shown in Figure 7.5. 


a { ARB2 |——————_-______, 

| eee | { 

A A { 

. : | 

: : 1 

rend | retamen | Vv 

| ARB [————-——_——>}_ nil |—-_————__ s 
|, ee | [| a | 
Figure 7.5 


An improved version of ARB. 
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The associated primitives ARB1 and ARB2 are defined as: 


nn RE 
{ Save the state of the futility flag and set it in order to | 
| detect it in the subsequent. | 
| BERENS A cet Te EE a a EE ee oe 
ARB1 PUSH (FUTILITY) 

FUTILITY = 1  (S) 


Qa et) ee ee gee Saw ty ee eee ee ee ee ee ee ee ee te 
{ If the subsequent is futile, restore the old futility flag | 
{ and length fail provided we're in QUICKSCAN mode. ( 
fe ce ce ae a ee Ee | 
ARB2 FUTILITY = EQ(FUTILITY,1) EQ(&FULLSCAN, 0) 

+ POP () 2S (LF) 


erp rT a age ne ae En eee ee ne ee AP aT te Pee 
1 Else bump the cursor and compare with LENGTH of subject. | 
| If beyond the end of the subject, pop the old futility |{ 
{ flag and match-fail. | 
fa rr ee a TE Ce EE EN | 
CURSOR = CURSOR + 1 
(GT (CURSOR, LENGTH) POP()) 2S (MF) 


ee ee a ey eg ee ee pe gS Ne a en Oe ee ee eae 
| Otherwise, play scanner by pushing ourself and the current | 
{ cursor onto the stack and succeed. | 
1 ee 


PUSH (NODE) ; PUSH (CURSOR) 2 (S) 


Note the action of ARB if its subsequent is futile. ARB itself 
is regarded as being futile and it indicates this condition by 
restoring the state of the futility flag. Note that this al- 
gorithm is obtrusive if the subsequent is varying. For exam- 
ple, the pattern matching statement 


'ABCB' LEN(1) $ X ARB ‘Cf *xX 


will succeed in FULLSCAN mode with X matching 'Bt but will 
fail in QUICKSCAN mode. . In QUICKSCAN mode the 'A' is assigned 
to xX initially; when 'c' match-fails, control arrives at ARB2 
which increments the cursor. Ultimately, ‘'C*' length-fails. 
When control arrives at ARB2, the FUTILITY flag is still on 
resulting in a length failure and termination of the match. 
If is important that ARB length-fail if its subsequent is 
futile. Consider the pattern match 


Ss ARB . T ‘CAT! 


which scans S for 'CAT' assigning the prefix to T. If no 'CAT® 
exists in S, the match will require on the order of L2 matches 
under FULLSCAN and on the order of L matches under QUICKSCAN 
where L is the length of the string. Here the desire to have 
unobtrusive heuristics seems to collide with the need for an 
intelligent scanner. No completely satisfactory scheme has 
yet been worked out. 
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BAL 


Define a balanced string as any string which either 1) does 
not contain a parenthesis, or 2) is a balanced string bounded 
by parenthesis or 3) consists of any sequence of balanced 
strings. The BAL pattern of SNOBOL4 matches all nonnull 
balanced strings beginning at a given pre-cursor position. The 
sequence of post-cursor positions is from smaller to larger. 
It is relatively straightforward to write a monic pattern to 
match the earliest (i.e. shortest) balanced string starting at 
a pre-cursor position. A parenthesis count is maintained. If 
a left paren is encountered the count is incremented by 1. If 
a right paren is encountered the count is diminished by 1. If 
the count ever goes negative the monic fails. If the count 
reaches 0 (after the first character), a successful match is 
reported. This monic pattern is available as a primitive 
(called GBAL) within the implementation and is used to imple- 
ment BAL. As an example the table below shows the behavior of 
GBAL on the subject 'A(C()D)*. 


Pre-cursor 0 1 2 3 i} 5 6 7 
Post~cursor 1 7 3 5 - 6 - - 


where a dash (-) indicates failure. BAL can be written in 
terms of GBAL as 


BAL = GBAL ARBNO(GBAL) 


and the corresponding BAL compound is shown in Figure 7.6. 
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Figure 7.6 


The BAL compound. 


The GRAL primitive, as the above example illustrated, is not 
monotonic and hence does not transmit length failure. GBAL, 
therefore, turns the futility flag off if it succeeds. If the 
subsequent s is futile, further alternatives need not be 
taken. 
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ARBNO (p) 


The path diagram for ARBNO(p) is obtained from the path 
diagram for p in the by-now familiar method suggested by the 
examples of ARB and BAL. Figure 7.7 indicates how we can form 
this path diagram from the path diagram for the pattern p. 
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carmen ( 
{ nil { t——_-—>| nil | > s 
{ > | 
[ ores | | eee | 
Figure 7.7 


A path diagram for ARBNO(p). 
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An expression of the form p . v where p is a pattern and v is 
a variable (or an unevaluated expression which will evaluate 
to a variable) is called a conditional variable association. 
The variable v is associated with the indicated pattern and 
will be assigned the substring matched by p on the condition 
that the overall pattern is successful. An expression p $ v 
is called an immediate association. Any substring matched by 
p is assigned immediately to v. The path diagram for p.v 
can be given in terms of the path diagram for p and is_ shown 
in Figure 7.8. A similar diagram could be drawn for p $ v. 


The stack which receives alternates and cursor values during 
the course of the match is called the pattern matching history 
stack or PM stack for short. To describe the operation of 
conditional variable association, we postulate the existence 
of two more stacks which we will refer to as stack Alpha and 
stack Beta. When VA1 (Variable Association 1) receives con- 
trol, it pushes the current cursor (pre-cursor position) onto 
stack Alpha. If p should fail, VAB1 (Variable Association on 
Backup 1) will receive control and it will pop Alpha. It will 
then fail forcing control to go to alternate a. Should p suc- 
ceed, control arrives at VA2. The current cursor and the pre- 
cursor pushed by VA1 are sufficient to define the string to be 
assigned to variable v. The two cursor positions and v are 


Page _136_____ Chapter _7__- Pattern Matching Implementation __ 


a 
A 
{ VAB1 | { VAB2 | 
| Seer eae | nee | 
A A 
: OOO0O0OCC0000O : 
: re) e) : 
| aoaecrartane | eo) O————— 4 aa | 
{| VA1 {—>o P O————> | ——_-——_-—> | VA2 {|—————_——> s 
——___—-4 10) ie) eeees { | ec | 
fe) o———_>4 
OOCOCO00000000 
Figure 7.8 


A compound for p.v 


pushed onto stack Beta and the cursor on stack Alpha is 
popped. Should the subsequent fail, VAB2 gets control and un- 
does what VA2 did. That is, the three values on Beta are 
popped and Alpha is pushed with the original pre-cursor posi- 
tion. VAB2 then fails forcing alternates on the PM stack to 
be invoked. 


If the overall match is successful, Beta is scanned on a FIFO 
basis (left-to-right) and assignments are made in turn. If 
the variable is an unevaluated expression, the evaluation is 
made at this time, by a possibly recursive call. 


Stack Beta is normally called the name-list stack. It operates 
in synchronism with the PM stack and, hence, it would have 
been possible to use this latter stack to push the two cursor 
values and the variable. It would not normally be difficult 
or time-consuming to extract these values from the PM stack at 
termination of matching. But differences in the way the gar- 
bage collector treats each stack may make a separate name-list 
stack desirable. Here, implementation considerations at the 
bit level often determine whether 1 or 2 stacks are used for 
this purpose. Stack Alpha, on the other hand, grows dif- | 
ferently than the PM stack. The overall system stack which is 
employed for expression evaluation and recursive calls is 
used. The system stack, as we will see, may be active during 
pattern matching (to implement unevaluated expressions) . but 
its net growth from the beginning of processing of one node to 
the beginning of processing of its subsequent is always 0 (un- 
less used as the Alpha stack of substring assignment). 


Immediate variable association is similar but simpler than 
conditional association and will be left as an exercise. 
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tances: | 
{ % % NEVALUATED EXPRESSIONS { Unevaluated expressions may 
1s © ————--—————————’_ be used as patterns and, if 
1% {1 so, are evaluated during a pattern match. The 
(%* € | kxresult of such an evaluaticn may be any pattern, 
{ #888 | even one containing unevaluated expressions. The 
u—————J difficulty with unevaluated expressions, which can 


result in arbitrary path diagrams, is in how to effectively 
combine the new path diagram with the old. In principle, this 
path diagram could be fused into the overall pattern by means 
of the pattern building process discussed earlier. However, 
Since this pattern is evaluated whenever the scanner is meving 
forward through the pattern, this pattern building process may 
take place many times during a single pattern match. Worse, 
the pattern would have to be detached kefore the next new pat- 
terns were joined and this would promise more difficulties. 
Hence, rebuilding the pattern is not a satisfactory solution. 


Let STAR be the program label associated with that part of the 
system which is to process unevaluated expressions. The argu- 
ment in the node associated with STAR is the unevaluated 
expression which we assume that STAR can readily evaluate. We 
note that the evaluation of the argument can invoke a 
programmer-defined function which can, by virtue of its  per- 
forming pattern matching, re-enter the scanner. This requires 
that, before the unevaluated expression is evaluated, a host 
of values such as the cursor position, the subject, the cur- 
rent value of the push-down list, and the NODE rosition be 
placed in the system stack to be restored after the argument 
is evaluated. In our pseudo-inplementation of pattern matching 
all this is taken care of automatically be declaring the ap- 
propriate variables to be either parameters or temporaries of 
the function MATCH. 


Assuming that this is done, the result of this evaluation is a 
pattern P. What STAR must do is aprly this pattern to the 
subject at the given pre-cursor position. This can be done by 
a call (recursive) to the function SCAN if we first provide 
isolation between this call and previous uses of the stack. 
This takes the form 


STAR P = EVAL (ARG (NODE) ) 
PUSH(NULL) ; PUSH (CURSOR) 
SCAN(LENGTH, P) : F (MF) S(S) 


It is a minor detail but if the result of evaluation is an 
unevaluated expression it is again FVALed. Assuring that a 
pattern P emerges from the evaluation procedure it is applied 
to the subject at the current cursor position by means of the 
call to SCAN. If P fails, the insulating null-cursor will have 
been popped and SCAN will fail. In this case STAR simply 
relays the failure. If P succeeds, SCAN will succeed and STAR 
reports success. If the subsequent to STAR is ultimately suc- 
cessful, nothing more need be said. If unsuccessful, the list 
of alternates laid down on the stack ky P must be invoked. But 
they cannot be invoked straight away as any gyrations of their 
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own accord would cause success or failure of the evaluated 
pattern P to be interpreted as success or failure of the pat- 
tern as a whole. Hence a kind of second insulation is set up 
to receive control should s fail. This comes in the form of 
the primitive RESTAR shown in Figure 7.9. 


ceo 
a -—-—--———}. RESTAR | 
{ 1 | 
A \ eee ene 
7 { A 
3 { : 
: { : 
ee { ae 
{ STAR | LU >{j nil | >s 
{ { >| | 
Bes eer 
STAR P = ARG(NODE) 
STAR_1 P =  EVAL(P) :F (MF) 
IDENT (DATATYPE (EF) , "EXPRESSION') :S (STAR_1) 


PUSH(NULL) ; PUSH (CURSOR) 
STAR_2 REDUCTION 0 


REDUCTION EQ(&FULLSCAN, 0) RESID(NODE) 

GT (REDUCTION, LENGTH) 2S (LF) . 

SCAN(LENGTH - REDUCTION, P) _ ?F (MF) S (S) 
RESTAR CURSOR = POP() 

P = POP() 

IDENT (P, NULL) :S (MF) F (STAR_2) 


Figure 7.9 


A compound to implement Unevaluated Expressions. 


When RESTAR receives control it pops the stack. If the alter- 
nate is null, this is the insulating null-cursor pair and 
RESTAR simply fails. Otherwise it merges with the STAR primi- 
tive which calls SCAN with the popped alternate as argument. 


The previously cited Recursive Reduction heuristic is shown in 
Figure 7.9. A fifth field of a pattern node is called the 
residual. This equals the minimum number of characters. re- 
quired by the node's subsequent to match. The field name used 
is RESID so that the data statement for a pattern node should 
really read 


DATA ('NODE (PROG, SUBS, ALT, ARG, RESID) ") 
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Residuals are computed by assigning a minimum length string to 
each pattern. For example, the minimum lengths of BREAK(S), 
TAB(N), POS(N) and FENCE are each 0. The minimum length of 
SPAN(S) and BAL are each 1. The minimum length of a string is 
the size of the string, etc. The minimum length of the 
concatenation of two patterns is the sum of their minimum 
lengths. The minimum length of the alternation of two patterns 
is the minimum of their minimum lengths. When two patterns 
are concatenated, the residual of each node is incremented by 
the minimum length of the second pattern. When two patterns 
are alternated, all residuals remain unchanged. The minimum 
length of a pattern can either be partially recomputed for 
each concatenation from the residual of the root node and the 
minimum length of the root or may be stored in a pattern 
header where global information about the pattern is kept or 
may be retained separately for each node in another field 
(MINLEN) of the pattern node. 


As an example of the recusive reduction heuristic 
Pp = *P 'At y tp 


will not loop. Since the residual of *P is 1 (the minimum 
length of 'A't), SCAN is called with ever decreasing LENGTH'S. 
On the other hand 


P = *P BREAK('A') BREAK('B') | ‘Bf 


will loop because the residual of *P is 0. Note that 
PREAK('tA't) BREAK('Bt) matches at least one character but the 
simple-minded minimum-character algorithm fails to detect 
this. 


It is not uncommon to experience the BNF-like expression 
P = *P *O | ‘At 


This pattern would loop if it were not for the drastic assump- 
tion that unevaluated expressions require a single character 
to match. This is the so-called one-character assumption. 
Given this assumption, the residual of *P is 1 and so the num- 
ber of recursive plunges is limited by the length of the 
string. Note that the one-character assumption has nothing to 
do with the number of characters required by *P but only *Q. 


I RE AS SE RE Ee Eg 
2??? > 4\EXERCISES ?? 
? 


J 


{| Exercise 7.1 | Implement the BREAK(S) primitive (call it 
t.—_—__________—!_ BREAKP) in SNOBOL4& source in a = manner 
Similar to the way in which the LEN(N) primitive (called 
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'LENP') was implemented in the text. Assume that ANY(S) and 
POS(N) are available. 


CS ee oe ee 

| Exercise 7.2 { There is a single pattern primitive called 
______.-__-___-§. CHARP which is used in matching any string 
against the subject. The string is contained in ARG (NODE) 


while PROG(NODE) contains CHARP. Assuming SUBSTR (Prog. 3.9) 
is available show how CHARP could be implemented in SNOBOL& 
source. Pass control to LF or MF on failure depending on 
whether or not the pattern is futile. 


| a a re Ee | 

{ Exercise 7.3 { After executing the instructions below, (a) 
___.______.._-_-J how many S-vacancies will there be in P? (b) 
how many a-vacancies? Express your answer in terms of N. 


P = ‘At 
I = 0 
LOOP P = (P{ P) (P| P) 
I = Y+ 1 LT(I,N) 2S (LOOP) 
Ce ee eee ee ee 
| Exercise 7.4 { As indicated in the text, to properly 


LJ concatentate two patterns, the first must be 
copied. Assuming the patterns are linked structures as in- 
dicated in the function CONCAT, implement CONCAT as a modified 
form of COPYL (Prog. 5.8). 


Ie are aa ERP ERENT, | 
{| Exercise 7.5 { A path diagram is well-formed if (1) any se- 
t__._____________5 quence of alternates ends in an a-vacancy 
(i.e. no loop of alternates exist) and (2) no loop of subse- 
quents exist. Show that any path diagram formed by alter- 
nating, concatenating or ARBNO'ting (see Figure 7.7) well- 
formed (and distinct) path diagrams produces only well-formed 


path diagrams. 


CSS es ete oo ae 

{ Exercise 7.6 | One implementation of patterns encodes them 
t—__________-_J as a contiguous set of nodes together with a 
header to form one large array as shown in Figure 7.10. 


“The root node is always node 1. The MIN field is the minimum 
length string that the pattern will match. FLAG and START are 
used as the anchoring field. If FLAG is 1 and START contains 
N, then the pattern is anchored in the form POS(N) ... If FLAG 
is -1 then the pattern is anchored in the form RPOS(N) ... If 
FLAG is 0, no special anchoring heuristic exists. 


The alternate and subsequent fields contain the subscript of 
the target nodes. If empty, these fields contain some nonposi- 
tive integers. 


ee eee __Exercises for chapter 7... Page TMI 
<i> | MIN { 

}—-—---—-_--+_-_--— 
<2> { FLAG { !—Header 


<3> | START { 


<4> 4 PROG { 
<5> | ALT | 


<6> | SUBS { {—Node 1 


‘ 
1 

1 

1 

1 

1 

J 

™ 

1 

1 

1 

y 

t 
SS 
<7> | ARG 1 
1 

1 

4 

q 

1 

1 

1 

! 

1 

\ 

1 

> | 


}-———-—--H 
<8> | RESID { 


—all other nodes 


Figure 7.10 


The data structure for a practical implementation 
of patterns. 


Write a subroutine to build (a) the alternation and (b) the 
concatenation of two patterns and (c) find the ARENO of one 
pattern. 


Foe a ee 

| Exercise 7.7 { How many primitive matches (successful and 
L_________-___-__J unsuccessful) are involved in the following 
pattern matching statements? 


(a) ‘'ABCDEFGHIJKLMN' ‘EFS. { *c! 


(b) DUPL('A', 20) 'B' an 
(c) DUPL('A',20) aN ‘Bt 
(d) ‘AAEAAACE! (fc! fF 'Dty (TE! | ft FL 
(e) DUPL('A',20) SPAN('A') { BREAK(‘A‘) 


(f)  ‘AABAAC! SPAN('A') "Ct 
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| Exercise 7.8 | Write the MATCH function which serves to 
L_________-____-—J_ drive the SCANer. Be sure to set and test 
the futility flag (FUTILITY) if &FULLSCAN is off and check 
SANCHOR. MATCH will have two arguments, the subject S and the 
pattern P. Have MATCH fail if the pattern fails and return 
the string matched if it succeeds. Be sure to indicate which 
variables are temporary. 


{ Exercise 7.9 { Which of the following monic patterns are 
tJ. _ transmitters of futility? 

(a) SPAN('AB') {| NOTANY('AB‘) 

(b) TAB(N) { POS(N + 1) 

(c) "ABA' 4 'Bt 

(a) "ABCD! | 'DCBA‘ 

{ Exercise 7.10 { Which of the following patterns are contex- 


5 tually anchored and what is the character 
set in each case? 


(ANY('AB') | SPAN('DE') | 'CAT') LEN(3) 
POS (3) BREAK('AB‘) 

(‘At | (SPAN('B') | 'CAN')) 

ARBNO (ANY ("AB") ) 


Ce ee eee el 

{ Exercise 7.11 | If the subsequent P to the pattern TAB(N) 
.—___-__-_-_—_J fails (even if the failure is match- 
failure) one may presume that TAB(N) P is futile and no 
increase in cursor position can help. How would we implement 
TAB(N) to take advantage of this? 


fo cee ee ee 

{ Exercise 7.12 { If a user requires that BAL match the null 
t_______________._§ string he may very easily create a pattern 
which will provide this extension. He may write: 


NEW_BAL = NULL { BAL 
(a) Draw the resulting path diagram. 


(b) Design a compound for implementing NULL | BAL directly 
(using GBAL of course). 


(ce ee oe at ae 

{ Exercise 7.13 { In QUICKSCAN mode, if the subsequent to 
t_____________J ARBNO(P) is futile, no further extensions 
need be taken provided P cannot match a string of negative 
length. The compound shown in Figure 7.11 below is designed 
to implement this heuristic. Describe the operation of the 
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primitives ARBN1 and ARBN2 in SNOBOL4 source, i.e. in a manner 
Similar to the descriptions of ARB1 and ARB2. 


OOOCCOCO000000 
oO Oo 
12) \e)  aoemeemnanen | 
a oO Pp O<———— | ARBN2 | 
(@) oO ne | 
A oO oO A 
2 OO0COD00CO00000 : 
H a re | Hy 
: on | : 
vw.v co 
{ nil {————————_____—->| ARBN1 |——~———————> s 
bee te 


Figure 7.11 


A path diagram to implement a futility heuristic 
for ARBNO. 


=. ee ee 

{ Exercise 7.14 | Design a compound for implementing BREAKX() 
i______________3 (the SPITBOL function, see Prog. 8.2) as- 
suming that the BREAK primitive and LEN(1) are available. 


Cos a eee a ee 

{ Exercise 7.15 { Describe how you would implement the ovat- 
L_______________J tern NOT(P) defined as matching the null 
string if P fails, failing if P succeeds, and aborting if P 
aborts. 


| ane ae ae oe aaa eee aa ars. | 
| Exercise 7.16 { In chapter 6, ARBNO(P) was defined as 
| SRR CEE | 

ARBNO (P) = NULL | P ARBNO(P) 


Show that the derived pattern of the path diagram in Figure 
7.7 is 


ARBNO(P) D(s) | D (a) 
where P is the derived pattern of the path diagram p. You may 


assume in your proof that P does not match the null string. 


Cong ee ee 
| Exercise 7.17 { The scanner function operates in such a 
_____._________J manner that the pattern implemented is the 
derived pattern: 


pD(s) { D(a) 
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“Rewrite SCAN so that the derived pattern is: 
D(a) { p D(s) 


Cl Se ay ee 
| Exercise 7.18 { Rewrite SCAN to implement the derived 


t_—____________-1. pattern 


(p | D(a)) D(s) 


(Hint: study STAR and RESTAR carefully and do not un- 
derestimate this problem.) 


es ee pe ee ee , 
| Exercise 7.19 {| To eliminate one of the nil nodes of Figure 


L—_________._.___-4 7.4, it is proposed that the alternate be 
"hung off' the LEN(1) node, eliminating the first nil en- 
tirely. Show that the derived path diagram of this combination 
does not equal 


ARB D(s) { D(a) 


as it should. 


Ge ee te ee 

{| Exercise 7.20 { Assume that a flag exists called UEFLAG 
______________3 which is set by STAR to indicate that an 
unevaluated expression was encountered. Modify ARB so that 
the length fail heuristic is unobtrusive but so that ARB 
reports length fail if there are no unevaluated expressions 
encountered in the subsequent to ARB. 
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oo) 

{ea! atterns are data objects and, as such, enjoy the same 
{t-4| rights and priviliges bestowed on objects having the 
(-— more conventional typings of STRING, INTEGER and REAL. 
| In particular, patterns may be assigned to variables 
us (possibly array elements or field variables) and may be 


passed to and from functions. This chapter tends to 
demonstrate these capabilities and describes a number of use- 
ful (and not-so-useful) pattern-valued functions and also 
provides a few very practical patterns for analyzing common 
linguistic cases. 


A word perhaps should be said about the virtue of attempting 
to solve as much of the problem as possible with one big pat- 
tern match. This can obvously be overdone. For example: 


Ss (REM $ OUTPUT FAIL | LEN(1) . T REM . S) 


serves to both print the string S and separate it from its 
first character. This has the same effect as: 


OUTPUT = S§ 
s LEN(1) . T REM. S 


The two-line version is clearer and, if anything, more ef- 
ficient and is easier to type and modify. The one-line version 
might perhaps be written to be cute or perhaps in the mistaken 
belief that statement overhead is significant (it is not). 


There are, however, often excellent reasons for using one pat- 
tern match as opposed to two or more. Consider looking for a 
quoted literal while analyzing SNOBOL4 source. Assume S con- 
tains a valid SNOBOL4 statement and assume we wish to search 
for the existence of a quoted literal assigning it to the 
variable X and transferring to NONE if none exists. One poor 
attempt is: 


Q = Wee 

QQ = tHe 

Ss (Q BREAK (Q) Q) . X :S (AROUND) 
Ss (QQ BREAK (CQ) QQ) . X 2 F (NONE) 


AROUND 
If the two pattern matches are replaced by: 

S (Q BREAK(Q) Q { QQ BREAK(QQ) QQ) « X : F (DONE) 
the result is not necessary clearer or more efficient but does 
have the beneficial property of not being wrong. If the string 
S contained 


then the two-pattern case would have erred. 


ee a ee ne ene me e Om ee ae ee eee ee ae ee See Sa a re ee ee ee a RR a Sa SS SS NN NED ES UT ES te ee wee 


There are times when a single large pattern can take the place 
of many lines of code. I have seen a case where a programmer 
wrote a machine-language subroutine (to be called from 
SNOBOL4) to parse the 360 assembler language where this parse 
can be written as one not-too-complex pattern (ASM360, Program 
8.11). The reason I saw it at all was because the program 
became a hopeless jumble and the writer of the program was 
virtually lost in a sea of complexity. The mistake made here 
was to assume that because, in assembly language, each step is 
quite clear, that the composition of an arbitrary number of 
such steps should also be clear. Programming offers no more 
vivid testimony than to deny this assumption. 


Cer te a pe 

{{ Program {/{ There are cases when it is desirous that the 
t! 8.1 1] pattern BREAK(S) match the entire string if 
1! BRKREM {1 (and only if) there are no break characters 
t-—____________J found. If it were not for the ‘only if! 


proviso, the pattern 
BREAK(S) | REM 


would do. But this pattern has the potentiality of matching 2 
strings; i.e. it is not monic. 


eee eee 
{ BRKREM(S) returns a pattern that will behave like BREAK(S) | 
{ if that pattern would succeed and will match the remainder | 
| of the subject string otherwise. 1 
jane ene a ev eT sR CT a eT a oR Cer a a eT EO ee | 


DEFINE (*BRKREM (S) CS‘) : (BRKREM_END) 


Go gt ee GE ee Pe ee eT es ee Pg ee eae Pa ee ee eee ee 
{ If S is null there are no break characters. Return a pat- |{ 
{ tern which will consume the rest of the string. { 
a nn es ar ep te eee 
BRKREM BRKREM = IDENT(S) REM 2S (RETURN) 
et ep ee ae eS Oe COE TE ig nent ee aN ge OO eT Pete ea ee Te 
| Find the set complement (CS) of S. If this is null, BRKREM | 
{ should match the null string. { 
Wi ttn tenet hac ac oa eins ea elec ii eas cle stte ath eta aeedeainiaeaamnrel 

CS = DIFF(&ALPHABET, S) 

IDENT (CS) :S (RETURN) 


re ee gg ee ee ae ee ae eS ee a eee TS a ee ge 
{ Otherwise return the alternation of 3 mutually exclusive | 
| cases. | 
ee ee EE EO PE NC ee RE TT ee ENN | 


BRKREM = RPOS(0) { SPAN(CS) RPOS(0) { BREAK(S) 
: (RETURN) 
BRKREM_END 
Names_referenced Name Type Where defined 


by_BRKREM: DIFF Function Program 3.10 
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BREAKX in S, stopping just short of the found 
_—_____________J character. The scanning is done as fast as 
the hardware will allow and, for 360 implementations this is 
quite rapid. But suppose the problem is not to scan for a 
character but for a string S. This can be done quite easily 
by the statement 


{tt Program {f{ The pattern BREAK(S) where S is a string 
| 8.2 tt will rapidly scan for one of the characters 
1 ti 


SUBJECT Ss 


To speed up the search, we might think of using BREAK to scan 
for the initial character of S as follows 


Ss LEN(1) . INITIAL 
SUBJECT POS(0) BREAK(INITIAL) S 


. this will succeed if S appears at the first instance of its 
initial character. Otherwise the pattern would fail since 
BREAK cannot match a string containing INITIAL. If we were to 
remove the POS(0) the pattern would ‘work' in the sense that 
it would succeed when required but the time required to do so 
could be worse than before. This is because the scanner would 
increment the cursor by 1 after each failure and thereby move 
quite slowly toward its destination. To fix the situation we 
define a function called BREAKX (BREAK eXtended) which, upon 
failing, will extend past the break character to find another. 
Like BAL and ARBNO, BREAKX is said to have implicit 
alternatives. 


BRFAKX was first introduced as a built-in function in SPITBOL 
and appears in SITBOL and FASBOL. 


DEFINE (*BREAKX (S) *) 3: (BREAKX_END) 


BREAKX BREAKX = BREAK(S) ARBNO(LEN(1) BREAK(S)) 
3: (RETURN) 
BREAKX_END . 
eee : 
Program ‘In analyzing programs BAL can be quite use- 


11 WW 

| 8.3 11 ful but it is also limited in that it cannot 
Vf BAL 11 be applied freely to expressions which per- 
1 ————$ 4 mit quote marks. For example, even though 
the string 


"ABC (DEF '(' GHI) JKL" 


is balanced in the syntax of SNOBOL4, BAL would not match it. 
Since most languages have the capability of permitting quoted 
expressions within an expression, this severely hinders the 
application of BAL. 


Analyzing languages which have bracketing other than, or in 
addition to, parenthesization also presents a situation in 
which BAL is inadequate. For example, suppose that a list of 
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a te ce ee eect rere nee ne mee erereneenes 


arguments (expressions separated by commas) is contained in 
the string LIST and suppose that its initial left parenthesis 
were removed. For example 


LIST = '13, A + B(3,4), Cc)! 


In order to pick off arguments from such a list, we may think 
of using the pattern matching statement: 


LIST POS(0) BAL . ARG ANY(',)') = 


Aside from the problem of quoted literals this statement will 
work correctly only if the source language contains no other 
kind of bracketing. For example, if the source language were 
SNOROL4Y and if LIST contained: 


LIST = '13, A + B<3,4>, Cc)! 


the pattern matching statement described above would find 
' A + B<3' as second argument which of course is incorrect. 


The function BAL(PARENS,QTS) will return a pattern which will 
match all nonnull balanced strings where the first argument is 
used to specify paired brackets in nested fashion and the 
second argument specifies characters used as quotes. For ex- 
ample BAL(' (<>) tf," tty will match a balanced string in 
SNOBOL4 source. Also BAL('()') is equivalent to the built-in 
pattern BAL. 


Let us consider how we might define the built-in pattern BAL 
if it did not exist before proceeding to the more general 
case. BAL is a pattern which will match any string balanced 
with respect to parenthesis. A balanced string is defined as 


1. Any single character not a parenthesis is balanced. 
20 If B is balanced or is null then '(' B ')* is balanced. 
3. If B, and Bz are balanced, then B, Bs is balanced. 


A straightforward translation of this definition could be used 
to define BAL and it would have the appearance: 


BAL = NOTANY(') (") { '(' (*BAL { NULL) ')* | *BAL *BAL 


The difficulty with this rendition of BAL is twofold. It uses 
the stack heavily (even when there are no parentheses in the 
subject) and it is inefficient especially if it is headed for 
failure. The difficulty in both cases is the third alterna- 
tive. As discussed in the previous chapter, there are two 
kinds of stack usage that we must be concerned with. There is 
the relatively mild requirements of the alternatives which 
must be placed on the history stack; then there are the more 
severe requirements of recursion. This version of BAL uses 
the recursion stack quite heavily. Consider the match 
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TREX: cae XY? *(* BAL *)! 


where there are N X's in the subject string. The maximum 
recursive level is N-1. What's worse, if the pattern BAL does 
not succeed as in 


"(XXX 2... X! (() BAL fy¢ 


the time required rises exponentially with the length of the 
subject. 


Another approach to encoding BAL is as follows: let GBAL match 
only the first balanced string (as opposed to all balanced 
strings). Then express BAL in terms of GBAL. 


GBAL 
BAL 


NOTANY(') (') | ‘'(" (*BAL {| NULL) ‘)* 
GBAL ARBPNO(GBAL) 


This. reduces BAL to sequential application of GBAL's and the 
time to determine failure does not rise exponentially. There 
is still the problem that the amount of stack used rises 
linearly with the length of the subject. Though this time, 
the stack used is the history stack and not the recursive 
stack. An alternate-cursor pair is laid down at each nonparen- 
thesis scanned in the subject string. As this may be distur- 
bing for large strings a better tactic is to reverse the order 
of alternation in defining GBAL as follows: 


GBAL = ‘'(' (*BAL { NULL) ‘')! | NOTANY (") (") 


There is a time-storage tradeoff here. While this version of 
GBAL consumes less stack, it requires slightly more time in 
the event that the pattern is to succeed. We will opt for 
reduced stack usage. 


Another problem associated with writing the BAL function is 
how do we return a recursively defined pattern from a func- 
tion. Consider the function F(P) which attempts to return a 
pattern to match a sequence of P's. 


DEFINE ('F(P) ') : (F_END) 
F F = P #*F [{ NULL : (RETURN) 
F_END 


F returns a pattern whose definition depends on the current 
value of F. But Lord knows what the value of F is after the 
return. It can be anything, since the old value of F is 
restored. Moreover, even if a global name were used, the name 
would be reassigned a new value each call. A way to avoid 
these: problems is to create a unique name at each call. Assume 
for the sake of argument that F1876 is such a unique name. 
Then if 


F1876 = P *F1876 4 NULL 


F = F1876 3: (RETURN) 
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were executed, the desired value would be returned. Code such 
as this could be created dynamically via the CODE function. A 
more efficient technique is to convert the unique name to 
EXPRESSION. This is done in defining BAL. 


DEF INE (' BAL (PARENS, OTS) Q, GBAL, NAME, STAR, LP, RP") 

: (BAL_END) 
Cg i ay ee a GE ee a epee ee EO NA Roe Set eee a pe REE Te pee eee oe 
{ Entry point: Create a unique but uncommon name (NAME) for | 
{ a variable which is to be assigned the pattern. To use it | 
{ recursively, we will need the associated unevaluated ex- | 


{ pression (STAR). Also initialize GBAL. { 
| ERE Fae an ene a Ee ee ee ee rT | 
BAL NAME = ‘BAL_.' &STCOUNT 

STAR = CONVERT (NAME, *EXPRESSION') 

GBAL = NOTANY (PARENS QTS) 


Wen ee GE rn ee eg i Pn ge ee he ORT ee a A ee en eg s ee ee ee a gD 
| Loop on quote characters inserting a quoted literal as an | 
{| optional condidate for a balanced string. © { 
On nner nee nese te epee Sep ss hssrrshso estriado cehsnvastsestonacneenadD 
BAL_1 QTS LEN(1) .Q = :F (BAL_2) 

GBAL = Q BREAK(Q) OQ { £4GBAL : (BAL_1) 
a re a eee ee eg ee a ee eee ey ate ee eT 
| Loop on the nested bracketing characters and create a | 
| balanced alternate for each pair. { 
a ee srt Sa ee 
BAL_2 PARENS LEN(1) . LP RTAB(1) . PARENS LEN(1) . RP 
+ :F (BAL_3) 

GBAL = LP (STAR { NULL) RP | GBAL 3 (BAL_2) 


SSS SS ee 
{ Define BAL (the returned string) in terms of GBAL and as- | 
{ sign it to the strangely named variable so that recursion | 


{| works. { 
societies i pce iia i sie a em accep esc pleco einen datinretnan—miantll 
BAL_3 - BAL = GBAL ARBNO(GBAL) 

$NAME = BAL : (RETURN) 
BAL_END 
Epiloque 


Note that the name of the function is the same as the name of 
a built-in pattern BAL. Both the variable and the function 
can co-exist and can be entirely unrelated. Note that when 
the function is called the variable BAL is temporarily as- 
signed a null value and is subsequently assigned the return 
value. Upon return, the original value of BAL is restored so 
no difficulty ensues. 


Co. ee 

{{ Program {|| A criticism that could be leveled against 
| 8.4 | the BAL function is that the pattern it 
{tt FASTBAL {4 returns creeps along, one character at a 
4 time, at speeds determined by 
ARBNO (NOTANY ()). A much faster version can be written which 


will skip over uninteresting characters at BREAK speeds and 
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<ae er Cae Gee ere eeec eee ames one ene A TO SS A SET SSE SIE CAE SN AEN AE RS EN 


stop only before parens, quoted-literals and any of a set of 


designated characters wrevided as a third argument. For 
example 
SNOARG = FASTBAL('(<>)t, tt! tt) toyt y | ARG ANY(',) *) 


will assign to SNOARG a pattern which can be used to scan for 
the arguments of a function call in SNOBOL4 source. If the 
string to be scanned is 


A 'BY' + F(")'), X) 


then SNOARG will tentatively match "A " and then "A "Bt + Ft 
before finally matching "A 'Bf + Ft) fy". FASTBAL, like 
BREAKX, will continue to take extensions. For example, the 
pattern match 


'A/B(/D)/D* POS(0) FASTBAL('()',,'/") '/D! 
will succeed with the entire subject being matched. 


Like BREAKX and unlike BAL, FASTBAL will not match the entire 
string since it requires a break character. Such a modifica- 
tion, however, is easily made and is explored in an exercise. 


DEFINE ('FASTBAL (PARENS, OTS, S) NAME, IBAL, SPCHARS, ELEM! 
+ ', LPS, Q,LP,RP') : (FASTBAL_END) 
SS SS SSS SS ey 
| Entry point: NAME is a _ uniquely created name for the jf 
| variable that will eventually hold the returned pattern. | 
{| IBAL is a pattern to match balanced strings on the in- | 
{ terior of brackets. | 
Weems airs ites ehh isp i iy ei ethers eh eid hameinsetacitetiniia 
FASTBAL NAME ‘FASTBAL_'* &ESTCOUNT 

IBAL CONVERT (NAME, '‘EXPRESSION') 

IBAL DIFFER(S,NULL) FASTBAL(PARENS, OTS) 


Bg ee ee NE re Og Teetae  Ge  e e ee e e N e e e e 
{ SPCHARS are all the special characters. ELEM is a monic | 
{ pattern to match a balanced string to be built up during | 
{ the subsequent computation. { 
Nt iieceeemenericsicetaiaiplateiinil meperidine sp ciclo ie tonaicanee 

SPCHARS = PARENS QTS S 

ELEM = NOTANY(PARENS OTS) BREAK (SPCHARS) 
ee ee ore 
{ Loop on quotes, oring in a quoted literal pattern for | 
{ every quote. { 
EE ce | 
FASTBAL_1 QTS LEN(1) .Q = :F (FASTBAL_2) 

ELEM = Q BREAK(Q) Q | ELEM :(FASTBAL_1) 


SS ee SS SS 
{ Loop on parens, oring in a balanced form for each pair. | 
j RES pene ven ce re I Pe PR RPI EE aE oN PES RCE SSE Pee ee ST LI a ee | 
FASTBAL_2 PARENS LEN(1) . LP RTAB(1) . PARENS 
+ LEN(1) . RP :F (FASTBAL_3) 

ELEM = LP IBAL RP {| ELEM : (FASTBAL_2) 


ee ee a ee ee eerie oem: a Eee cone aE ESS SE SS A EE TE TED MEAT SD SA ARNT PS AES ED COA 


ea eee a ny 
| Wrap things up and return. { 
oan Np oe oe te cee ea OE Date Se er ne eee Oe ea OEE? | 
FASTBAL_3 FASTBAL = BREAK(SPCHARS) ARBNO (ELEM) 

$NAME = FASTBAL : (RETURN) 
FASTBAL_END 


Cone te ee 

(| Program |f{ The function NOT(P) returns a pattern which 
tt 8.5 | will match the null string provided P would 
tt NOT (1 fail and will fail if P would succeed. 
t____.._______.__F NOT(P) is undefined if P is nonlinear. As 


an example of the use of NOT assume we wish to write a pattern 
which will match a PL/I comment. The pattern '/** ARB **/! 
will not do since it will match other things in addition to 
comments. For example it will match three strings in the PL/I 
statement below where only two are comments. 


GOUT /* GARBAGE OUT */ = GIN /* GARBAGE IN */ 
To match a comment we can writes: 
'/*' ARBNO(NOT('*/") LEN(1))  '*/" 


Here the ARB is replaced by a pattern constructed from ARBNO 
which will match an arbitrary string not containing the sub- 
string '*/', To speed up the search for the closing '*/' we 
can employ BREAK as follows: 


'7*t ARBNO(NOT('*/") LEN(1) BREAK('#**')) «st 


The function NOT is so constructed as to be embeddable in it- 
self. Thus NOT(NOT(P)) will match the null string if P would 
succeed. Also if C were the comment matcher defined above, 
NOT(C) would operate correctly. 


One drawback of NOT, which is the reason we will not use it 
more widely in building other patterns, is that it must be 
used in FULLSCAN mode. The reason for this is the one- 
character assumption of the recursive reduction heuristic 
described in the previous chapter. Since mode switching is 
generally poor programming practice, we will generally avoid 
the use of NOT. 


Fos eos gh SN Re ee ee ee ee ee en a ee ee ee ee ee 
{| NOT(P) will return a pattern which will match the null | 
{ string if P fails and fail if P matches. If P aborts, | 
{ NOT(P) will also abort. { 
Iii snes ei iii cence titi i aise cman momen 
DEFINE (*NOT (P) ') : (NOT_END) 
ep er ann a ee eee ee ee eg en he tg gee OR SOL ne TE ee 
{ Entry point: Return a pattern which pushes null onto the | 
{ stack and replaces it with nonnull only if the pattern | 
{ succeeds. The flag is eventually popped and tested by the | 


oP eee cr mi ore a 
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{ alternative. I 

cae ae ce eS a EE RO EE eS | 

NOT NOT = *PUSH() P *(POP() PUSH(1)) FAIL { 

+ *I DENT (POP () ) : (RETURN) 

NOT_END 

Names_referenced Name Type Where defined 

by_NOT: PUSH Function Program 5.5 
POP Function Program 5.6 

Epiloque 


P is assumed not to have side effects which will alter the 
stack. For example, if 


P = NULL | *(POP() PUSH()) FAIL 


then P will cleverly undo what NOT was trying to do and cause 
NOT(P) to succeed where it should always fail. But this 
amounts to almost delikerate meddling. If P uses the stack 
normally (i.e. leaving its state the way it was found) then 
NOT will operate correctly. 


| aaa acerca eam | 
{{ Program {{ ONCE() returns a pattern which will succeed 
| 8.6 ] once and only once and thereafter fail 
an | ONCE 11 forever. For example the pattern matching 
t———___-______} statement 
"AAAB! 'A' ONCE() '"B' f 'B! 


will result in the 'B' being matched, but not the 'AB', since 
the first time through the left alternation, 'B' failed, in- 
dicating that that path could no longer be taken. Note that 
ONCE () must return a new and distinct pattern on each call 
since once it is used it can never be reused. 


ONCE () is similar to FENCE in that it matches the null string 
initially. Unlike FENCE, however, failure in subsequent tries 
is like FAIL (as opposed to ABORT) which permits other alter- 
nates to be taken. 


CS RE CRRA EE RC ES, 
{ ONCE() will return a pattern that will succeed just once. | 
han einen cme mmm noms nee eine me DARI nA ES gf een scsi nt aetna 


DEFINE ('ONCE(ID) NAME‘) . : (ONCE_END) 


Fr re ee eg a eR ee eS ee ee 
{ Entry point: If the argument is null we return a new pat- | 
{ tern equal to *ONCE(id) where id is a unique integer. | 
| RS ee Ea EE EE EE | 


ONCE ONCE = IDENT (ID, NULL) 
+ CONVERT (*ONCE(" &STCOUNT ')' , *EXPRESSION') :S(RETURN) 


We ete et ee a ee ee eg eg ee ee ee eee 
{ Otherwise compute a name based on the unique ID. Return | 
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{ its value. It will be initially null. Set it to FAIL for | 


{ all. subsequent calls. \ 
(AER Ri OC TR AN ee Ee EI ee Re ee LOS, | 


NAME = ‘f'ONCE..' ID 

ONCE = $NAME 

$NAME = FAIL : (RETURN) 
ONCE_END 
Epilogue 


the function ONCE() returns an expression of the form *ONCE(n) 
which will succeed just once and fail forever after. It il- 
lustrates several principles. First, a function can return 
different patterns and each of these patterns can vary their 
own behavior with time. Second, the function serves both to 
return a pattern initially and is also the function invoked 
during the match. Both of these operating principles will be 
in use in the next function. 


The technique used to encode ONCE() can be used to pick off 
the first match of a pattern and thereby increase efficiency. 
See Exercise 8.8. 


Cares a ee 

{' Program || TEST is designed to alleviate some of the 
| 8.7 | problems involved with the one-character as~- 
Vt {1 sumption which we have already indicated 
t__-_-____-_____4 might be a source of difficulty with the NOT 
function. TEST will accept an unevaluated expression as argu- 
ment and return a pattern. When the pattern is encountered by 
the scanner during a pattern match the original unevaluated 
expression will be EVALed and the pattern will succeed or fail 
depending on the outcome of the EVAL. If it succeeds it mat- 
ches the null string. For example 


TEST (*LGT (A, B) ) 


will return a pattern which, during pattern matching, will 
succeed or fail depending on whether A is, or is not, lex- 
ically greater than B. 


Thus TEST(exp) acts like exp. It differs from exp in that its 
minimum length will be 0 as opposed to 1 and it will match the 
null string if the evaluation succeeds. 


DEFINE (* TEST (ARG) NAME ') : (TEST_END) 


Ce ee ee ee ee 
{ Entry point: If ARG is an EXPRESSION we will return a | 
{ pattern. The expression is saved in a unique name (NAME) | 
{ and this name, in the form of a string, is used as an ar- | 
{ gument on subsequent calls to TEST. | 
ce a ans Rane nec mse mis a ii i i en ee nanan a 
TEST IDENT (DATATYPE (ARG) , ‘EXPRESSION'*) :F(TEST_1) 
NAME = ‘'TEST_* &STCOUNT 
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$NAME = ARG 
TEST = EVAL("NULL $ *#TEST('" NAME "*)) — : (RETURN) 


{ If ARG is not an EXPRESSION we presume that we are dealing | 
{| with one of those subsequent calls to TEST. In fact, we | 
{ can conclude that wetre in the middle of a pattern match. | 
{ Retrieve the old expression and evaluate it and return a | 
{ dummy name. | 
{REE ert eer ee a EE ED | 


TEST 1 TEST = ?EVAL($ARG) .TEST_ :S (NRETURN) F (FRETURN) 
TEST_END ; 

Ca ee ee te 

{| Program {| LIKE(S) returns a pattern that will match a 
| 8.8 W string like the one passed as argument. A 
it LIKE {| like string is defined as anyone differing 
L-—______________—_J from the argument by a) a rearrangement of 


two characters, b) the deletion of a character or c) the 
insertion of a character. 


DEFINE (*LIKE(S)C,T1,T2,N*) : (LIKE_END) 


Se a at eee ce a ae ea Oat e Te Pe ae Leer PETE ae BT ae eh he pee ee oe eee 
{ Entry point: Make sure that S itself is regarded as LIKE | 
{ S. | 
seco nti enti aici cence 
LIKE LIKE = §$ ; 


Gr ee ee ee ee ee ee ee Te a ee ee 
{ Loop on N where N denotes a cursor position within s. | 
{| Split S into two parts, T1 and T2. | 
: 5 
LIKE_1 S  TAB(N) . T1 REM. T2 . : F (RETURN) 

N = N+ 1 


De ee ge ee ee ee eh eT ee ee 
{ First OR in a pattern which matches S with one character | 
{ inserted at position N. | 
eet st senses mT nr AE AP et Dg Se ener rerecusesneremmeierencncisoaacel 


LIKE = LIKE | T1 LEN(1) T2 


SSS ee ey 
{ Then OR in the pattern which matches with one character | 
{| deleted at position N. ( 
ae etal 
T2 LEN(1) .C = :F (RETURN) 
LIKE = LIKE {| T1 T2 


er ee ee ee ee ae 
{ Then OR in the pattern where the two characters at posi- | 
{| tion N have been rearranged. | 
(a scsessstneousaietsnaansieesisnetsiisaastnennttinninesate annie tgs tment Givens isis ss sep aensepmnensesumbdtsneestennvumncnaiemsishearanancisemsiarevell 
T2 Pos(1) = Cc :F (LIKE_1) 

LIKE = LIKE {| T1 T2 : (LIKE_ 1) 

LIKE_END 
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pom 
{{ Program |] OR(S) is intended to form the OR (in the 
| 8.9 tt pattern sense) of several strings contained 
tf OR tI in Ss. For example OR(',ABC,DEF,XYZ') IS. 


ef EQUIVALENT TO 

‘ABC’ | 'DEF* | *xyz'* 
The initial character (in this case a comma) is used to 
separate elements. For efficiency puroses, OR will factor out 
like initial characters. Thus 

OR(', ABLE, ACTOR, ANCHOR, BAKER, BULL") 
is equivalent to 
‘At ("BLE § 'CTOR' { *NCHOR') { ‘pt (‘AKER* { 'ULL*) 
The resulting expression in this example is over twice as fast 
as alternating 5 strings since for most subjects only 2 checks 
are needed for every pre-cursor position as opposed to 5. The 
initial character extraction is done to arbitrary levels so 
that 
OR (', ABC, ABBOT, ACTOR, BAKER') 
will return 7 
“TAT (tBY (fCh f BOTT) | 'CTOR') | *BAKER' 

For efficiency purposes, if a factored character contains only 
one branch, the character is combined with the head of the 
branch. Thus 

OR (',ABC,ABROT, BAKER‘) 
returns 

‘AB ('C# | *BOT') | "BAKER! 

Characters in parenthesis imply an ANY-like construction. Thus 

OR( ',C(AO)D,C (AO) ST‘) 
will return 

*ct ANY (*AO*t) ('D? { ST!) 


Several examples of the use of OR are given in the initializa- 
tion section of HYPHENATE (Program 10.7). 


an Serene ae EE Semen ats eee - Woe ee 


re ee ee 
{ OR(LIST) will return the alternation of the substring of | 
{ LIST separated by the break character determined by the | 
{ first character in LIST. Parenthesized strings are |- 
{ regarded as ANY. { 
Niessen scsi a ip Seimei limi sepia baciigseminlicgetatiesieioatsentedenhoieat 


DEFINE (OR (LIST) BC, SEIZE,ANC') 


Gr eer eg oe ae oR pe ae ee ee te en ne PP ee ee ne ee Pe age ee 
{ OR_EXTRACT() is a function used by OR to extract from the | 
{ global variable LIST, the substrings beginning with the | 
| same first character (or parenthesized expression). | 
ce cmstns beh lnc cs eee snc sm mim ns Sus seat sentences 


DEFINE ('OR_EXTRACT () COMMON, IC, P, SUBLIST,T, TLIST,C1,C2") 


: (OR_END) 
ret renee RY 
{ Entry point for OR. Determine the break character and | 


{| define a pattern to be used throughout to SEIZE all up to | 
{ the next break character. Define ANC as a pattern to | 
{ anchor the scan and match the Break Character. | 
Oi secgsesinaabasesincnonc psionic iis es iii ede peas ata 


OR LIST LEN(1). . BC : F (FRETURN) 
SEIZE = BREAK(BC) | REM 
ANC = POS(0) BC 


SS SS SS 
| Or together all extractions. | 
i 2 enti no il eit nisin cei cenit es incitement nap isis ing ca taiiegsll 
OR OR_EXTRACT () 

OR_LOOP OR OR | OR_EXTRACT() :S (OR_LOOP) F (RETURN) 

re ee te ee ee a Ce ee ee eee ae a ae ee 
| Entry point for OR_EXTRACT(): Set TLIST to be a copy of 
{ LIST. Extract initial character (IC) and set COMMON equal 
| to the first substring. If this pattern fails, no IC could. 
{ be found. This means that LIST is either empty in which 
{ 
| 
| 
( 


case we fail, or contains only BC in which case we return 
the null string. Both of these cases are important since 
the former terminates the loop. in OR() and the Latter 
breaks the recursion of OR_EXTRACT(). 

Nis cence emer la msl Sleeps vanes lun daemon apeiiniovcenasesinies it tempemsieiampsonbsansaseacicenatioeamill 


OR_EXTRACT 
TLIST = LIST 
LIST ANC (BAL-. IC SEIZE) . COMMON :S(ORX_1) 
IDENT (LIST, NULL) :S (FRETURN) 
LIST = NULL : (RETURN) 


GO nn re eg ne eg ee Ee Pep ee ee ee ee ee 
| Find the largest COMMON prefix contained in all strings | 
! beginning with Ic. ! 
ec necesito nteneneeennastn crete refs en eestor cncncsemcwall 


ORX_1 TLIST ANC IC :F (ORX_3) 
ORX_2 TLIST ANC COMMON SEIZE = 7S (ORX_1) 
reer amenererareneaameesa trey 
{ COMMON was not there. Reduce COMMON by one character and | 


{ try again. This means extract the last balanced string of | 

| COMMON. . ; 1 

|S rao So Pe RC a a oe PE EE EE ee EEE! | 
BALREV (COMMON) BAL REM . COMMON : F (ERROR) 
COMMON = BALREV (COMMON) : (ORX_2) 
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Ce ee an ee ne ge nS ee ee ee ee ae ee oe ee Se eee 
{ Now remove the COMMON characters from each string as we | 
{ prepare a SUBLIST to be OR'ed. | 
| rr ne | 
ORX_3 LIST ANC COMMON SEIZE .T = 2 F (ORX_4) 
SUBLIST = SUBLIST BC T 2 (ORX_3) 


re ee a eR Re Ne ee ee ee ey ee ee ee 
{ Convert any parenthesized expression in COMMON to an ANY. | 
{| Build up the pattern in a temporary P. Then join this with | 


{ the result of a recursive call to OR. t 
eS aN NE SEES tn Here ee aT EE eS eT 
ORX_4 COMMON BREAK('(') . C1 '(* BREAK(')') . C2 
+ y' = :F (ORX_5) 
P = P C1. ANY(C2) : (ORX_4) 

ORX_5 OR_EXTRACT = P COMMON OR(SUBLIST) : (RETURN) 
OR_END 
Names_referenced Name Type Where defined 
by_ OR: BALREV Function Program 3.8 
fe ee ee ea ee : 

Program This pattern is intended to match a PL/I 


i! ({ 

11 8.10 it statement (assigning to STMT the string 
1 tl matched) and to fail if none exists. The 
td presumed scenario is that a program is 
reading. lines of a PI/I program and continues to apply the 
pattern until it succeeds in matching a prefix of the combined 
input lines. The pattern need not check for syntactic correct- 
ness of the input and hence it will be sufficient to check 
for the presence of a semicolon provided this character does 
not appear within quotes or comments. 


er arg a ne dee eT SF ee RE Ne Be ap eS ae OR ee ee ae ee eee ee 
| Define an ELEM as a quoted literal or a comment or a non- |{ 
{ null sequence containing neither a semicolon nor a comment | 
{ or quote delimeter. | 


1@) 9 wee 

QLIT = Q FENCE BREAK(Q) Q 

CMNT = '¢/*' FENCE ARB '¢**/! 

ELEM = QLIT { CMNT { LEN(1) BREAK('/;* Q) 


Ge a pe ee en ee ee ee ge ee ee er sy ee eee ee ae 
{ Use back-up-free scanning (Chapter 6) to search for the | 
| statement. - | 
| a re re rE a a I eT Ee EN | 


PLI-STMT = POS(0) (ARBNO(ELEM FENCE) ';') . STMT 


1! Program {|| Many problems involving the processing of 
tt 8.11 | assembler source can be conceptually simple 
{{ ASM360 tt and yet provide a challenge to the program- 
t_—_—__________-—_ mer. Consider the problem of reformatting 


the source so that various syntactic parts such as operations, 


Page 160 Ss Chapter § - PATTERN CONSTRUCTION 


operands and comments are set to allign at pre-determined card 
columns. The heart of this problem as well as many others is 
simply the extraction of the various fields since once these 
have been obtained it is a relatively simple matter to recast 
a given line in a new format. Different assembler languages 
offer different problems to be solved. The OS assembler 
[ IBM360b] is noted for its relative ubiquity and complexity 
and will offer a fine example to consider. 


In the OS assembler there are four fields separated by blanks, 
viz. 


NAME OPERATION OPERAND COMMENT 


where the optional NAME field must begin in column 1 if it 
exists. One is tempted to use BREAK(' ') to separate the 
fields. This works for the first two fields but the operand 
field may have blanks embedded in quoted literals and so this 
simple scheme will not do. Moreover, the quote that appears 
in an expression beginning with L' is not to be considered for 
quote-balancing. Thus 


L MVI 3,L'ABC "THIS IS A COMMENT' 


has an operand field (3rd field) that breaks after ABC and not 
after THIS. The rule for determining whether L' is to be 
considered specially is given on p. 71 of { IBM360b] 


"An apostrophe not within a quoted string 
immediately followed by a letter and immediately 
preceded by the letter L (where L is preceded by 
any special character other than an ampersand) is 
not considered in determining paired apostrophes." 


On page 10 of [IBM360b] we obtain the definitions of ‘letter! 
a2and ‘special character' and so we begin coding ... 


'ABCDEFGHIJKLMNOPORSTUVWXYZ$#0' 
Me- =, *()'/E 1 


LETTER 
_ SP.CH 


Sa a I Aa Ia RENATO REE | 
{ From this we obtain ‘special character other than |{ 
{ ampersand' which we will call SCOTA. ! 
a a as 


SCOTA = SP.CH 
SCOTA '§' = 


Cn eg Pe eg Te Re eT eR RS ee ee ee ee ae a ee 
| We consider the line decomposed into disjoint elements | 
{ where each element is either (in order) a quoted literal, | 
{ an Lt construct, a single SCOTA or a_ sequence of | 
{ non-SCOTA's. { 
ae ecient emesis ans i on oi em eg sign ll 

Q = ten ie oy 

QLIT 

ELEM 


Q FENCE BREAK(Q) Q 
QLIT {| 'L* Q | ANY(SCOTA) { BREAK(SCOTA) { REM 


Ra SS SAS SSS ES RE a ea acer aaa | 
{ From this we may use back-up-free scanning to define the | 
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OED ae ee ee ene ce eS are eee ee 


{ operand field (F3). B is used to separate fields. The | 
{ first two fields according to p. 8 of [IBM360b] are ter- | 


{ minated by blanks (or the end of the line). { 
| SRC er re Ee ae I ee a aE aT | 


F3 = ARBNO(ELEM FENCE) 

B = (SPAN(' ') {| RPOS(0)) FENCE 
Fi = BREAK(" ") | REM 

F2 = F1 


a a SA SII AIT GO Dea EERE, | 
| To further complicate the issue, if the operation is one | 
{ of a class of conditional assembly operations defined on | 
{ pe 75 of [IBM360b] as: { 
A gin ecm ins i es amt ss cee iin wan oe ante tit oa 
CAOP = ('LCL' | ‘'SET') ANY('ABCT) | 
+ ‘AIF { *acot { *ACTR* | *ANOP ¢ 
See ee ee ee eT ee ee 
{ then the operand is a conditional assembly operand. For | 
{| such operands the number of ways of using the quote | 
{ character in unbalanced situations is increased. For ex- | 
| ample T'NAME refers to the type attribute of the symbol | 
{ NAME and the quote here is not to be considered as one of | 
{ a pair of balanced quotes. The set of attributes is given | 
{ by the pattern ATTR. | 
ac saute is ri germans th seta opus secens tims, imams yin ioe i cnrmsimninseuadl 
ATTR = ANY('TISIKN') 


Moreover, the operations SETB and AIF permit ‘logical ex- | 
pressions enclosed in parenthesis'. Logical expressions | 
may contain blanks so we must ignore any blanks contained | 
within paired parenthesis. Of course we must ignore any | 
parens within quotes and we must continue to ignore quotes | 
which occur merely as part of an attribute. Since it can- | 
not hurt to ignore blanks within parens in any of the con- | 
ditional assembly operations we can treat all of them | 
uniformly. ELEMC is an expanded form of ELEM permitting | 
the additional attributes and the parenthetical groupings. | 
F3C will match an operand field (field 3) if the operation | 
is a conditional assembly. { 
(oa csesenennnseuetssepsal enn ha is -s-o=tsss-ssissseshs-seete-s sssaeieseseensaeaseansshansnensnsesenanantecemsanll 

ELEMC = ‘¢*(' FENCE *F3C *')! { ATTR Q { =ELEM 

F3C = ARBNO(ELEMC FENCE) 
Weer ie lage See ee Ee ge ee OR LE Ph a eh TT TT eg ee ee ae ee aN 
{ Putting it all together: | 
| rc ce | 


ASM360 = F1 . NAME B 
+ ( CAOP . OPERATION B F3C . OPERAND | 
+ F2 . OPERATION B F3 . OPERAND) 


+ B REM . COMMENT 
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Or Ie OA Oe SE OE A EE EO AE OE OE AED A OE AE A A OE OS ED 
2282222222227 2727272727922 «EXERCISES 227727227272222222272722222222 
P2PPPPZAPPPPPPVAVPPBPPVAVPPPVPVPPPPIPPPVLIPPIPPLPPPPPPPPPPPPIIP 
Cee ee Se ee ee 

| Exercise 8.1 { Assuming S is nonnull, rewrite BRKREM(S) as 


t—_________-___—-J5 a single expression involving only (but not 
necessarily all of) LEN, POS, RPOS, SPAN, BREAK, ANY, NOTANY 
and ARBNO. 


a S| 

| Exercise 8.2 | Write a version of SPAN(S) (call it SPANULL) 
t______________5 which will match the null string in the case 
that SPAN(S) would fail. Otherwise, SPANULL(S) should behave 
exactly like SPAN(S). Thus SPANULL(S) must be monic. This 
can be done in several ways. Try it a) using NOT(P), b) using 
EBRKREM(S) and c) from scratch. 


Ca ee ee ee 

{ Exercise 8.3 { Modify BREAKX (call it BRKXREM) so that it 
t_..___________J will match the remainder of the subject 
string as its last extension. Thus 


'A,B,C' POS(0) BRKXREM(',') $ OUTPUT FAIL 


will print 'A't, 'A,Bt and 'A,B,C*. 


Gane hapa, 

{ Exercise 8.4 | Which of the following assignments would 
t-—___-_________! also be valid - ways of implementing 
BREAKX (S)? That is, which of the statements below, if sub- 
stituted for the one statement in Prog. 8.2, will produce a 
correct rendition of BREAKX? 


BREAKX = ARBNO(BREAK(S) LEN(1)) BREAK(S) 
BREAKX = BREAK(S) (NULL {| LEN(1) *BREAKX) 
BREAKX = ARBNO(LEN(1) BREAK(S)) BREAK(S) 
BREAKX = BREAK(S) (NULL | LEN(1) BREAKX(S)) 
ee ee ee 
{| Exercise 8.5 | Given the subject, "AB(C,D')E*)GH", which 


L_______-J_ values of pre-cursor position will the 
pattern 


BAL('()° - remy ANY (',) *) 


match? 


CS ne eee oe 

| Exercise 8.6 {| Let RULE be string-valued and contain the 
L.-J rule of some SNOROL4 statement (i.e. the 
statement without the label and goto fields). Assume the rule 
is trimmed of leading and trailing blanks. Write code to 
determine the type of SNOBOL4 statement and branch to one of 


Se RY Ee TOES SAN ER RE SC EE 


the following labels: Pm for pattern match, PMR for pattern 
match with replacement, ASGN for assignment and EXP for none 
of the above (Hint: Using the BAL function, this will require 
one pattern assignment and three pattern matches). 


ee ee 

{| Exercise 8.7 { The author once comitted an error similar to 
_______________§ the following. Assume that to create a truly 
unusual name the first statement of FASTBAL (Prog. 8.4) is 
changed to: 


FASTBAL NAME = ‘FASTBAL ' &STCOUNT 


Surely, vanishingly few identifiers contain blanks and the 
S&STCOUNT makes it that much more unusual. Why is this an 
error? 


es een 

{ Exercise 8.8 { Write a function FIRST(P) which will return 
_—___-__-___——J a monic pattern whose post-cursor position 
is the first post-cursor position yielded by the pattern P. 
Note that unlike ONCE(), FIRST(P) should be reset at each cur- 
sor position. 


ia aaa aa ee IE | 

{ Exercise 8.9 | What is *ONCE() equivalent to ? 

[ eee EN | 

SS ee 

{ Exercise 8.10 | Write a function NTIMES(N) which will 


LJ return a pattern which will match the null 
string exactly N times and thereafter fail forever. 


Cases en eee a 

{ Exercise 8.11 | Write a function IF(P) which will match the 
t—___——_--——————J null string if P would succeed and will 
fail if P would fail. (Hint: you may use functions defined in 
this chapter). , 


Se ee eee ee 

{ Exercise 8.12 {| Let the SIZE of a string S be L. How many 
t________-____J alternates will LIKE(S) have (Prog. 8.8)? 
Modify LIKE so that it uses OR (Note: ANY (&ALPHABET) can be 
used in palce of LEN(1)). How many principal alternates will 
LIKE then have (assume that S contains at least 3 characters 
and that the first two characters are different)? What is the 
fewest number of principal alternates that LIKE could have? 
Rewrite LIKF to obtain that many. 


CS eee ee 
{ Exercise 8.13 {| Modify LIKE(S) (Program 8.8) so that, in 
L____--—__--_-_——J_ addition to insertions, deletions and rear- 
ragements, any string differing from S ina single character 
will be matched. 
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{| Exercise 8.14 | LIKE will tolerate just one error. Rewrite 
L.-J LIKE so. that it will tolerate K errors 
(Hint: Rewrite LIKE recursively). 


Ss 
{ Exercise 8.15 | What character(s) could. not be used as a 
-_—-—------J_ break character for OR? 


eri ns eres | 
{| Exercise 8.16 {| To allow for really rapid scanning for a 
U—______________3 set of strings, modify OR(S) so that it 
returns 

BREAKX(S1) OLD_OR (S) 


where OLD_OR is the OR function defined in Prog. 8.9 and where 
S1 is derived from the argument S. 


rs ae ee 
{ Exercise 8.17 | Rewrite PLI.STMT so that it does not use 
t_______..-__-_—J_-_—«SXFFENCE but NOT instead. 


ce oe . 

{ Exercise 8.18 | Find a subject for which PLI.STMT will 
___-._____--_.-I._ behave incorrectly if any of the following 
changes are made. 

(a) removing the FENCE from QLIT 

(b) removing the FENCE from CMNT 


(c) removing the FENCE in the argument to ARBNO. 


a a aera eR em 
| Exercise 8.19 | A telephone information service operates by 
L.-J the user dialing (or touch-toning) a 


party's name using the letters that appear on the dial. This 
does not uniquely specify a string of letters since each digit 
has a group of 3 characters associated with it as follows: 


ABC ~- 2 PRS - 7 
DEF - 3 TUV - 8 
GHI - 4 WXY - 9 
LKJ - 5 Z - 0 
MNO - 6 


Write a function called NAME which accepts as argument a 
string of digits and will return a pattern which can be 
matched against all names in a directory. The pattern should 
be of the form ANY() ANY() ... ANY() where there are as many 
ANY's as there are characters in the string. (Hint: the body 
of the function requires only 3 relatively simple statements.) 
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Cre ee oe 
{ Exercise 8.20 { Assuming that LEN(N) can have negative ar- 


t________--_______—J_ guments we could make a rapid search for 
the least likely character of a string using BREAKX. For ex- 
ample, to scan for 'EXAMPLE' in a string of text, it would in 
general be more efficient to use the pattern 


BREAKX ('X') LEN(-1) ‘EXAMPLE! 


than a BREAKX('E') construction because of the low frequency 
of the letter 'X' in English text compared with 'E*. Write a 
function called SEARCH(S) which will return an optimal pattern 
in the above form for searching for the string S. Assume that 
S contains only alphabetics and that the letter frequency is 
that of English, viz. 


FREQ TBL = 'ETOANIRSHDLCWUMFYGPBVKXQJZ' 


(Interesting note: The least-frequent character can be deter- 
mined in one statement by a simple scan.) 


CO nm rcreor"”"’ i ee me | 
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[ em | 

traf ne of SNOBOL&S's many assets is the simplicity and 
1! «{ directness of its I/0. One need merely mention the 
{! ¢{ variable INPUT in an expression and, automatically, a 
{*—5| card (or card image) is read and the string of charac- 


i——J ters on the card is used as the value of the variable 
INPUT. Similarly, the mere assignment of a value to _ the 
variable OUTPUT or PUNCH will cause that value to be respec- 
tively printed or punched. 


In many cases, however, we want something slightly richer than 
this, as the following programs will illustrate. 


SSS 
{{ Program {f{ For many applications the basic input 
11 9.1 | process is less than completely ideal. We 
tt READ | often would like to read in a card, compare 
t___--—--_______--} it against a pattern, and, if the card was 


not what we sought, transfer to another section of the program 
which will read the same card from the input stream. Our aim 
could be realized if we had the ability to put something back 
on the input stream. This act is impossible in SNOBOL4 but it 
could be effectively done by writing a subroutine which could 
store things we ‘pushed! onto the input stream and yield them 
up when we sought to read. This we will not do (but leave as 
an exercise). We will create something which will be less 
general but simpler and, in most situations, easier to use. 
We will define a function called READ which will accept one 
argument, viz. a pattern, which will be matched against the 
next string on the input stream. If the pattern matches this 
string, the string will ke returned. If the pattern fails to 
match, the READ function will fail but will save the string 
for the next time READ is called. In the several programs 
following this one, we will show how this property can be 
used. 


Another inadequacy with the basic input facility of SNOBOL4 
has to do with file sequencing on the IBM 360/370. When no 
more input remains on the current input file, and an input re- 
quest is made (by a reference to the variable INPUT) the 
reference will FAIL (in the SNOBOL4 sense of statement 
failure). If an input request is made after the initial 
failure, the next file in sequence will be opened. If this 
file is not present, the program terminates abnormally. 


Unfortunately, this is not what we want most of the time. 
Often, the reason several files have been placed in sequence 
is to make them appear to the program as one long file, an ap- 
pearance which is blemished if failures occur in between. Also 
we would like the liberty of making several read requests 
after the final failure without fear of blowing the program. 


READ will take care of this file sequencing problem. It will 
fail only after the last file has been exhausted and subse- 
quent calls thereafter will merely fail. 
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Ce ee a ea ee eee 

{| READ(P) will read in and return a card provided it is mat- | 

| ched by the pattern P. If there are no cards remaining or | 

{ if the pattern fails READ will fail. { 

as 
DEFINE ('READ (P) *) 

: (READ_ END) 


a Pe ee CM an Fete te ae Eee Te NTE re Ee ee TIE ee Ge re OW, 
| Check to see if the number of files beyond the current is | 
{ negative. If so return failure. { 
{Sn eee a ee | 


READ LT (NF_INPUT, 0) . 2S (FRETURN) 


a ee ae Sp Erte ee pe EE ee Re Ee GE ge ne ee 
{ Fill the input buffer if it is empty. ; | 
Uh ax insane ieee int tienen tenga ecient 
IDENT (INPUT_BUF, NULL) :F (READ_ 1) 
INPUT_BUF = INPUT 3: F (READ_ 2) 
READ_1 


SS 
| Check the buffer for a successful match against P. If no | 
{ match, then fail return. If match, then return the value | 
{ in the buffer (INPUT_BUF) and clear the buffer. { 
cece ca eee pn Sa rE ane cre I a AP Eee eS | 


INPUT_BUF P :F (FRETURN) 
READ = INPUT_BUF 
INPUT_BUF = NULL : (RETURN) 


a pe. re ee et ee Se OU a Ng eR a EERO Soe ee COTE Ta eee re eee 
{ If the attempt to read resulted in failure, then control | 
{ passes to READ 2. Deduct 1 from the number of remaining |{ 
{ files and transfer to label READ. If this number becomes | 
{ negative, the function will fail continually. | 
nn tesesesancnnssastnasste sss eS ss sss > sss ss se sssierfatssteshsnseheasaarsesarecmmnessnll 
READ_2 NF_INPUT = NF_INPUT - 1 : (READ) 
READ_END . 


Epilogue 


The variable NF_INPUT (Number of Files on INPUT) is to be set 
equal to the number of files beyond the current one. Normally 
NF_INPUT is equal to 0 since the default value of variables is 
null (which numerically equals 0). Therefore, the programmer 
normally need not worry about its value. However, he may set 
this at any time during the running of the program if ad~- 
ditional files remain. For example if a special marker is 
placed at the end of a file to indicate that this was not the 
last one in a sequence then the appearance of that marker 
could he used to trigger an assignment of the value 1 to the 
variable NF_INPUT. 


Corer ee ee ae 

{' Program tt Many string-processing problems involve the 
tt 9.2 11 analysis of the source language of some 
{{ FORTREAD {{ other program. FORTRAN is perhaps typical 
| 


_-——— of the kind of language which we might wish 
to process. Examples include compilation (translation of 
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FORTRAN programs for sematic errors not discoverable by the 
compiler), flow charting (describing diagrammatically the flow 
of control), preprocessing (translation of an extension of 
FORTRAN into FORTRAN such as SIMSCRIPT [Dimsdale & Markowitz, 
1964], and conversion (translating a version of FORTRAN for 
one machine to a version suitable for another). In addition 
to these fairly complex undertakings, the processing could be 
some simple house-keeping chore such as converting every 
reference of ‘ALPHA' to a reference to 'BETA'. 


When writing programs to analyze other programs it is usually 
wise to write a function whose only duty is to collect and 
return the next statement on the input stream and FAIL if no 
statement remains. The benefits of doing this are the same as 
those derived from subroutinizing one's program generally. It 
saves duplication of code, allows subdivision of labor, the 
program logic is easier to follow and the program is easier to 
modify and maintain. 


A card with a 'Ct in column 1 is regarded as a comment card by 
the FORTRAN compiler. Comments may appear anywhere, even bet- 
ween a statement and its continuation. These are ignored. A 
continuation , card is indicated by a nonblank in column 6. A 
blank in column 6 indicates the start of a new statement. 


Gr ee Ee eS ee, PT ET ee ee 
| FORTREAD will read in and return the next FORTRAN state- | 
{ ment on the input stream. { 
ca cece sss emia i emesis iat sai a emer emnonit nee Ga oma aminicasemnammncae-al 


DEFINE ('FORTREAD ()T') 


INPUT (. INPUT, 5,72) 
FORT_COMMENT = POS(0) 'C! 
FORT_CONTINUE = POS(0) LEN(5) NOTANY(' ') REM. T 


3: (FORTREAD_ END) 


Geis tg ee daw epee PEE a ae ee Ge ee ergy RN nn te ee Se ee he ey See 
| First pass over any initial comment cards and then read in | 
{ the first statement. | 
Se Se AP Pe I ee ee aE a eR ER Eee REA 
FORTREAD READ (FORT_COMMENT) :S (FORTREAD) 
FORTREAD = READ() :F (FRETURN) 


WI a ee Ol eed pe le er GE he sO Ee a gh eda ae a ge Pe re ee Be 
| Then pass over more comments (if any) and then look for a | 
{ continue card. If not found we return. But if found, the | 
{ variable T will hold the desired value. This is tacked | 


{ onto FORTREAD and we renew the search for a continue. { 
pce atisiorersieeesicn nineteen cicero lant imho et eae atonement meiaicaninsisaiall 


FORTREAD_1 READ (FORT_COMMENT) 2S (FORTREAD_1) 
READ (FORT_CONTINUE) :F (RETURN) 
FORTREAD = FORTREAD T : (FORTREAD_1) 


FORTREAD_END 


Names_referenced Name Type Where defined 
by FORTREAD: READ Function Program 9.1 
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| Epilogue 


The initialization section of FORTREFAD reassociates the 
variable INPUT with the first 72 characters of a card. In this 
way the identification field of the FORTRAN deck (columns 73 
through 80) are ignored. 


Two patterns are also set in this initialization section. The 
first pattern matches successfully any FORTRAN comment card; 
the second will not only match successfully a FORTRAN continue 
but will assign the 'meat' of any continue card to the tem- 
porary variable T. 


One may note the rather heavy use to which READ has been put. 
It is called at four separate places and has greatly sim- 
plified the writing of FORTREAD. The first call represents a 
rather conventional use of READ. “Give me the next card if it 
is a comment." It is in fact thrown away immediately. The 
second call of READ, which is made with no argument, makes use 
of the fact that a null string will be supplied by default. 
Since a null string as a pattern will always match, READ() is, 
in effect, an unconditional grak at the next string on the in- 
put stream. It can only fail if there is nothing left. 


Another use of READ is in the fourth call in the third last 
line of the program. This call not only tests the next string 
but causes a variable (T) to be assigned a subpart of the 
string. Patterns, in general, can denote arbitrarily complex 
computations with the subject string as effective argument. 
This property of patterns imparts to READ a high degree of 
flexibility. 


ae et ae 
Program For many of the same reasons that we might 


1! 11 

it 9.3 | want a FORTRAN statement grabber if we 
{1 PARAGRAPH || were procesSing FORTRAN decks, we might 
L--—-_—--_ +4 want a paragraph grabber if we are proces- 
sing text. A paragraph, here, is assumed to be a sequence of 
lines down to the next paragraph whose start is designated by 
a blank in column 1. Since the information on the cards is 
assumed to be sentences, we will place a blank between lines 
(after trimming). Moreover, if a line ends ina period, we 
‘will place an extra blank between it and the succeeding line, 
Since it is conventional, in typing, toc separate sentences 
with two blanks. If no paragraphs remain, or if the first line 
to be read does not match the pattern passed to PARAGRAPH as 
argument, then PARAGRAPH will FAIL. 


Ces a ae ate ad ie ee eee a el, ee OR EP gg! eae a a Ao ae ee ee eee ee Te 
{ PARAGRAPH(p) will read in a paragraph provided the first | 
{ card on input matches the pattern p. The paragraph is as- | 
{ sumed to continue until a blank appears in column 1. It | 
{ will fail if a paragraph is not found. { 
etter ens see Sei teense i ue sai as cn seal 
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DEFINE (* PARAGRAPH (FIRST_LINE) T, P*) 
PARA_CONTINUE = POS(0) NOTANY(' *‘) 
: (PARAGRAPH_END) 


ee age en en eg ee ee ae ee eet er Te eR ee 
{ Read in the first line, provided it is the first line of a | 
| paragraph. If it is not, fail. | 


(oneness ersten senescent er sanss/sissa-sirsl s een inaueneess eseeeenenaressnrrssraneal 


PARAGRAPH P = TRIM(READ (FIRST_LINE) ) : F (FRETURN) 


Gore ty ng a RP See ee ae ae LT OR Pete el Te aE he eee 
{ Set the variable T equal to 2 blanks or 1 blank depending | 
{ on whether or not the paragraph accumulated so far (in P) | 
{ ends with a period. | 
SR Ee a EE ee Ee SE Te | 
PARAGRAPH_1 T = ?¢ # 

P POS(0) RTAB(1) '‘.! :F (PARAGRAPH_2) 

T = q As] 


PARAGRAPH_2 


RRA TR A Te DERE ETC CEE IE EERE | 

Now join the next input line provided it is still part of j 
+he paragraph. If so, recycle; otherwise return what is | 
in P. Note that the blanks in T are not joined to P unless | 
the READ() is successful. { 
a ee a ee ee TE a ee EE | 


P = P T TRIM(READ(PARA_CONTINUE) ) :S (PARAGRAPH_ 1) 
PARAGRAPH = P : (RETURN) 
PARAGRAPH_END 
Names referenced Name Type Where defined 
by_ PARAGRAPH: READ Function Program 9.1 
Epiloque 


PARAGRAPH, like FORTSTAT, refers to the READ function to do 
its basic input. The pattern which defines what determines 
the start of a new paragraph (or more exactly the end of a 
current paragraph) is contained in PARA_CONTINUE. This pattern 
can be modified for slightly different paragraph conventions 
or can be set as an argument. 


Note that the temporary variable P was used to accumulate the 
material in the paragraph. The variable PARAGRAPH could have 
been used and this would have saved one assignment statement. 
P was used for brevity and convenience and with the knowledge 
that straight assignments of the kind indicated are quite fast 
and their effects on the running time of the overall program 
are negligible. 


tt tt For many of the same reasons that. we would 
4 9.4 i want statement~-gathering activities to be 
{{ SNOREAD {| focused in one function in FORTRAN statement 
——____________—J processing, we would want to do the same if 
we were processing SNOBOL4. A complexity introduced in ob- 
taining SNOBOLG statements is the possibility of multiple 
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statements per line (separated by semicolons). Moreover, the 
fact that quoted literals may have semicolons embedded within 
them means that a blind search for a semicolon will not do. A 
further complexity is introduced by the fact. that labels may 
have quotes embedded within them (only semicolons and blanks 
may not appear in labels) so that such quotes are to be 
ignored when ignoring semicolons within quotes. But we have 
encontered such problems in the preceding chapter and, by now, 
they should be routine. 


Like FORTSTAT, SNOREAD will ignore comment cards and fail when 
no more statements remain. 


Cr ee rr ee pe en ee Ee Re an en Poe ey ge COL ee pee Oe 
{ SNOREAD will read in and return the next SNOBOL4 state- | 
{ ment. If no statements remain it will fail. | 
Mc en cis ni mm is oni ERE ops seis ii ln ronan isin incnsoecicsiataaan ceseconmmioeanisiseall 


DEFINE ('SNOREAD() S,LBL') 


{ Initialization section: Establish I/0 and initialize | 
{ patterns. { 


eres ee enesnenevenaruaternnanssstnassns penitent ase sous sos onssnseenunaenansnanemencmnsinaeal 


INPUT (.INPUT, 5, 72) 


ALPHA = ‘ABCDEFGHIJKLMNOPORSTUVWXYZ! 
NUM = '0123456789! 
CONTINUE.S = POS(0) ANY('+.') REM. S 
SNO_STMTS = POS(0) ANY(ALPHA NUM ' ‘') 
SNO_STMT = (POS(0) BREAK(' 3°) . 

+ : FASTBAL( , "! "8%, tsty 830) | SNOREAD 


: (SNOREAD_END) 


Ge ee pe eS eG Sep ET See SRO TP ee et Ee a pee ee ee ee 
{ Examine a buffer (SNO_BUFFER) which presumably has charac- | 
{ ters in it left over from the last read. If a statement {| 
{ can be pulled out, fine, just return. { 
coe ert ete epson us hse l= =P erent 


SNOREAD SNO_BUFFER SNO_STMT = : :S (RETURN) 
Ra dg Re da eee a ee ae te ge ee gee te eee ee ge = Se Fe Re ee te 
| Otherwise check the buffer for null. If nonnull, then | 


{ there is a syntactic error in the input. ] 
I ssieenssinee tai niet tinea titi ic gigi cng ee chica iii 


IDENT (SNO_ BUFFER) :F (ERROR) 
ee en Pee at Sie Se a Se GE ee I a en SE Ee ee et oe Te eT Pee ee 
We now try to fill the buffer. We first make an attempt 


to read the first card of a sequence of SNOBOL4 = state- 
ments. If this fails, we assume it's a comment or list 
control card; in either case we throw the card away and 
try again until we succeed in getting a statement or hit 
an end of file. 

fncasciceicessupesh ncn ies iia na un pneu ein emirate insets i meoesmniacamstansinran emseioncinioaanasnnaitaniadll 
SNOREAD_1 SNO_BUFFER = TRIM(READ(SNO_STMTS)) :S (SNOREAD_2) 
READ () . : F (FRETURN) S (SNOREAD_1) 


ee ee aT a ee ee ee NT 
{ Scoop up all succeeding. continue cards and place a | 
{ semicolon behind the last card. Then go back to the start | 
{| of SNOREAD. | 
| EER ne ec aN cE a Ea a TE EE eC A ER a ee TO IEE | 
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SNOREAD_2 SNO_BUFFER = SNO_BUFFER ‘¢* * ?READ (CONTINUE. S) 
+ TRIM(S) :S (SNOREAD_2) 
SNO_BUFFER = SNO_BUFFER ‘'‘';! : (SNOREAD) 
SNOREAD_END 
Names_ referenced Name Type Where defined 
by_ SNOREAD3; READ Function Program 9.1 
FASTBAL * Function Program 8.4 


* ijndicates name is referenced in the initialization section. 


A tree, in the context we will be using it, 


{{| Program 1 
{1 9.5 tt will be a collection of data in a hierar- 
{{ TREEREAD {| chical organization. An example of a tree 
a is shown in Figure 9.1. 
cor 
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Figure 9.1 


An example of a tree. 


There is a root node at the top (just the reverse of 
biological trees which have their roots at the bottom). The 
root node has 0 or more immediate descendants or sons. Each 
of these, in turn, have 0 or more immediate descendants. 
Moreover, each node has a value associated with it which, for 
the sake of current discussion, we will assume is a string. 


In the example shown in Figure 9.1, the root node has the 
value ‘'A* and its 3 sons have the values 'B', ‘Ct and 'Ft 
respectively. 


Reading a tree implies both an external form by which the 
programmer specifies his tree, and an internal form by which 
the tree will be represented in the machine. These represent 
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two decisions which will have to be made before we can 
progress further. 


In general, the representation of computer data is’ an issue 
which is perpetually confronted by the computer programmer. 
His choice can significantly influence the runtime and storage 
efficiency of the resulting program, as well as the ease with 
which he can write, debug, modify, and extend his program. In 
a string language such as SNOBOL4 there is a built-in 
prejudice to represent. data objects as strings, because of the 
languagests rich string handling capability. That is, one 
feels that when it comes time to process the data object, ina 
way or ways not clearly foreseen at the start of the program, 
the necessary tools will probably be there. , 


Another strong advantage of using strings to represent data in 
SNOBOL4 is the relative ease with which one can monitor the 
changing forms of the data. There are several semiautomatic 
tracing features available to the SNOBOL4G user (§&FTRACE and 
ETRACE) which print out the values of variables if they are 
strings, integers or reals but not otherwise. Under such cir- 
cumstances the advantage of using strings to represent data is 
more than obvious.* But even if these tracing features were 
not especially inclined to favor the string, there is nonethe- 
less a convenience in being able to display an entire data 
object in one fell swoop merely by printing a string. 


Another advantage of using a string to represent the data is 
that (in SNOBOL4S at least) the data within the string will oc- 
cupy contiguous storage locations. This can mean that certain 
kinds of analysis can be made very rapidly by a scan. Many 
machines have built-in mechanisms for quickly scanning con- 
tiguous core storage for particular data items. Such efficient 
machinery can be brought to bear upon a data structure in con- 
tiguous core whereas it could not if the data were associated 
by means,. for example, of address links. 


One reason for not representing a tree as a string is that the 
values of the nodes may not be conveniently representable as 
strings. Another reason may -be that the operations that an 
application will typically make upon a tree may be rather un- 
natural for a string. We will show in a later chapter how a 
tree may be represented in SNOBOL4 as a linked structure. For 
this chapter, we will consider only string representations. 


There are many ways in which trees may be represented as 
strings internally. To visualize one very exotic way, imagine 
that a tree is elaborately displayed in a printout page with 
lines of, say asterisks connecting up boxes denoting the 
nodes, etc. Then the sequence of lines of this printable image 


* This limitation need not be viewed as a_ strict one. The 
discussion surrounding the function FTRACE, Prog. 14.3, 
describes how the values of data aggregates may be 
automatically dumped as well. 
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will, when concatenated, denote unabiguously a tree. Such an 
example is a very good one of how not to encode a tree. Not 
only is the encoding inefficient in terms of storage but it 
also would prove to be unwieldy in processing (selecting, 
searching, deleting, adding, etc.). 


One sane way of representing a tree is by a LISP-like 
representation [McCarthy, 1960]. A node is encoded 


(Ve SaeS2¢-ee¢Sn) 


where v is the value of the node, and where each s is the 
representation of a son. For example, the tree in Figure 9.1 
is represented as 


(A,B, (Ce (De E)) » (F/G) ) 


Using such a representation, the value of nodes are restricted 
in that they may not contain commas or either of the paren- 
theses (or if they do, three other characters would have to be 
found at the loss of some notational naturalness). Another 
disadvantage is that, in many applications, it is convenient 
to be able to obtain, without an involved computation, the 
number of sons of a given father node. For both these reasons, 
we will use a slightly different method which is a variant of 
polish prefix notation (from Lukasiewicz [{195, p. 78] but see 
Higman [1967, p. 24] for a nice general discussion. We will 
represent a node as 


ViNeSge SaeeceeeSn 


where, as before, v is the value of a node, n is the number of 
sons and s represents a son. The tree in Figure 9.1 would be 
represented as; 


Ag3eBeeCe 2g Dee Eg eo Fe 1p Gee 
Here a node without sons is represented as 
Ver 


That is, the null string as well as an explicit 0 can be used 
to denote 0 sons. This blends well with the SNOBOL convention 
of regarding null strings as arithmetically equal to 0. 


The parenthesis-free or polish notation is somewhat more dif- 
ficult to analyze visually than the parenthesis notation but 
it is significantly easier to manipulate and for that reason 
is a good machine representation. 


The external representation of the tree would be that form as 
it is keypunched onto cards or typed onto a teletypewriter. 
To be more explicit, we are concerned with an external input 
representation as opposed to an external output representa- 
tion. There are obvious fundamental distinctions between a 
tree representation which one is willing to type and a tree 
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which one would like to see. For the former, we require ease 
of typing and ease of modifying which are not considerations 
of the latter. 


The form of external input representation we will use is 
similar to the form used by COBOL and PL/I to represent struc- 
tures. The root node is said to be on level 1. Its immediate 
descendants are on level 2; the immediate descendants of any 
node are one level number greater than the level number of 
that node. Thus the representation of any node of a tree is 
given as 


where k is the level number of the node, v is the value of the 
node and each s represents a son (in the same format). For 
example, the representation of the tree shown in Figure 9.1 is 


WNHWWNNDN = 
AAACN YS 


This form of the tree is not difficult to type or to modify. 
It is also not very difficult to read, particularly if the in- 
put processor permits indentation (as ours will) so that the 
tree may be typed: 


The actual program to convert trees from the external input 
form into modified polish is given below. 


oe ee ee te ee ee an pe eee ae Pn Chap e PE e Be Le a a ee he ee ee 

{ TREEREAD (level) will read a tree beginning at the given | 

{ level. It will fail if this level is not found on the | 

{ input. | 

rcs eepere apn sm ss sh mresrerru rm mnt mem emi eich eres tet evntaran aus emnsaovcmarannassinisinemmnsosierall - 
DEFINE (* TREEREAD (LEVEL) SONS, N‘) 


Cn a ne ae et Te Re a eee at dee Ph eT oa, Ne na ER ee ew 
{ TR_BC is .the tree break character used to separate items | 
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{ in the strungout version of the tree. ( 


an se ent nstennennune esse cneesenssemnemnennateevennmenrermnaenssenmmcemned 


TRBC = ',! 


la NOW Sa Pe SAREE ACAI PRT a 
{ The pattern LEVEL.TREEREAD tests the level and extracts | 
{ the value placing this value into TREEREAD. { 
cick clei elites iambic ilaneeianineaiciiouteniiiiiiaemaictnniaiiicil 
LEVEL.TREEREAD = POS(0) (SPAN(' *) {| NULL) *LEVEL 

+ SPAN (' *) REM . TREEREAD 
: (TREEREAD_END) 


eT ne pe ee ee ee ee 
{ Read in the node at the current LEVEL and assign the value | 
{ of this node to TREEREAD and tack on the break character. | 
{ If the LEVEL argument does not match the input level then |{ 
{| fail. | 
ris ease eri in rrr ei se inp mtnsb se emer einewaoenoa 
TREEREAD READ (LEVEL.TREEREAD) : F (FRETURN) 
TREEREAD = TRIM(TREEREAD) TR_BC 


Gee ee aN A oem Se ee ee Pelee PA eae he EE TET ET RC ey eg gee ee ee 
{ Read in the sons of this node by calling TREEREAD recur- | 
{ sively at a level one higher than the current level. The | 
{ number of sons is counted in N. | 
[ee | 


TREEREAD_1 SONS = SONS TREEREAD(LEVEL + 1) 
+ :F (TREEREAD_2) 
N = N#1 : (TREEREAD_1) 


eee ae ee Ee gee epee Oe ee ee an ee ee a 
{ Concatenate the value of the father, the number of sons | 
{ and the representation of the sons. { 
Se eae a ae oR a a ee a a a | 
TREEREAD_2 TREEREAD = TREEREAD N TR_BC SONS 


: (RETURN) 
TREERFAD_END 


Names. referenced Name Type Where defined 
by_TREEREAD: READ Function Program 9.1 
Epilogue 


The first executed statement on entry to TREEREAD calls the 
by-now familiar READ, requesting that a card be read only if 
it is of the level requested. TREEREAD will then call itself 
recursively to obtain trees at levels one deeper. When recur- 
sion is called for, the savings in program length can be 
dramatic and the subjective effects exhilarating. There are 
types of environments in which recursion seems quite well 
suited. One of these environments is when the data structure 
is organized recursively such as the trees in this example. 


The break character is set in the initialization section to be 
a comma. This can change at any time by assigning a new break 
character to the variable TR_BC. 
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tt tt The READ function (Program 9.1) is flexible 
1 9.6 (1 to the extent that input can be obtained, 
1! HH not merely from the standard card reader, 
ad but from any file associated with the 
variable INPUT. That is, we could reassociate the variable 
INPUT in order to obtain the INPUT from a source other than 
the standard input. An example of a reassociation of INPUT 
was given in the FORTREAD and SNOREAD functions (Programs 9.2 
and 9.4); there, INPUT was reassociated not with a nonstandard 
file (although it could have been) but with a file whose 
record length was nonstandard (i.e., 72 rather than 80). 


It may be, however, that it is desired to read from two or 
more files simultaneously and then, the original READ would 
not do. Even if the user would be willing to reassociate the 
variable INPUT on each shift of the input stream, the scheme 
would not work because the saved string in INPUT_BUF would 
become hopelessly mixed between the various streams. 


But it is possible to generalize READ to handle multiple 
streams. Our extended version will allow a second argument to 
indicate the source. Thus 


READ(P, .SYSUT1) 


will read from source associated with the variable SYSUT1. 
Also, a null second argument will imply the stream associated 
with INPUT. Thus, READ(P) will be equivalent to 


READ(P, ~INPUT) 


In this way our new READ will be uvward-compatible with the 
old READ. 


The new READ, while more general, is less efficient than the 
Old READ, and so there are advantages to both. In practice, 
one can do with the efficient READ until such time as it 
becomes necessary to read more than one stream; then one can 
simply 'plug-in' the more general READ. 


SSS SS SS ee 
MFREAD(P,U,L) will behave like READ(P) except that an op- 
tional second argument (U) can be used to specify a unit 
other than the normal reader. An optional 3rd argument 
can specify a logical record length other than 80 (for the 

first call associated with a given unit). 
| er Orn EE ee | 
DEFINE ('*MFREAD (P, U, L) BUF, NF, NM, DATA‘) 


mg a aT SI a CE | 

{ Establish structure to hold data on each file. | 

a a caine ie italien nn min sere css cca 
DATA (*RDATA (RNM,RBUF,RNF) ') 


| CRS a Raa a a a ET EERE SE CR ES ORR | 
! Establish table to hold structures. Establish default | 


___- Program 9.6 -_MFREAD ss Page (179 


{ file. { 
pr es cre nr i nui cm sem mfp esas epi psanempncaamiomsell 
READ_TBL = TABLE() 
READ_TBL<> = RDATA(.INPUT) 


Ges ae pr ae eo PE EO pe ee AGT a een RAG oe SeM RE ee ee ye ee ee ee 
{ Sieze control on calls to the REWIND function. Do a real | 
{ rewind but also discard any file information for unit N. { 
a sissies apm li a oti episcopal nitriles ticimebdaabeioeah 


OPSYN ("'REWIND.', 'REWIND*) 


DEFINE (‘REWIND (N) ') : (MFREAD_ END) 
REWIND READ_TBL<N> = 
REWIND. (N) : (RETURN) 


Cot ee ee, eT ee ee ee ee ee ee a Ee ey 
{ Entry point: Obtain DATA associated with unit U. If DATA | 
{ is null establish an entry for this unit and input- | 
| associate some contrived name. { 
ais nse iit scp iia imi hm eect a aac tice tient snail 
MFREAD DATA = READ_TRBL<U> 

IDENT (DATA, NULL) :F (MFREAD_1) 

NM = ‘READ: * U 

DATA = RDATA(NM) 

READ_TBL<U> = DATA 

INPUT (NM, U,L) 
re ae ee mg gn Gk ee Re Gace ate Pete tea aT er NE hg oe ee ee 
{ Arrival here means that DATA contains the data associated | 
| with our i/o unit. Extract information. If NF is less | 
{ than 0 fail immediately. { 
ae near es Na ae CC I aE CSN RE ane ae ALTE NEES, 


MFREAD_1 NM = RNM(DATA) 
BUF = RBUF(DATA) 
NF = RNF(DATA) 
LT (NF, 0) _ $S (FRETURN) 


De ee PR ee ee On Re pee ea eee ee tT eae ae Fae ee ee 
{ If BUF is null, fill it. Then test it against P. If fail, | 
{ FRETURN. Otherwise return BUF. , { 
a a 


IDENT (BUF, NULL) :F (MFREAD_2) 
BUF = $NM :F (MFREAD_3) 
RBUF (DATA) = BUF 

MFREAD_2 BUF P :F (FRETURN) 
MFREAD = BUF 
RBUF (DATA) = : (RETURN) 


Ce ee ee ee eg ae ee ee ee Ser EE eg Cee ee ea aL ee ee 
{ Decrement NF and try again. { 
i repens lenient ein ait tgs a apap aaah 
MFREAD_3 RNF(DATA) = NF - 1 : (MFREAD_ 1) 
MFREAD_END 


Epiloque 


The extended version of READ is patterned after the single- 
file READ. There are several additional statements in the 
initializing section which set up the names of variables which 
are to be indirectly referenced. Beyond the label READ_3, 
things are pretty much the same as the simpler READ with in- 
direct referencing replacing the direct referencing. That is, 
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instead of referring for example to the variable INPUT_BUF a 
reference to the variable $B is made where B has been assigned 
an appropriate name. 


The first statement executed (after the entry point) assigns 
the name 'INPUT' to the variable F provided F is null. This 
is a common way of assigning default values to dumny 
‘parameters in functions. 


The reader may be somewhat alarmed as to the amount of over- 
head associated with each read request. This overhead, 
however, may be quite tolerable in a programming situation 
which involves relatively few reads compared with other com- 
putations or in a situation in which programming the problem 
costs more than running it. If the overhead proves excessive, 
the reader will find an.outline for a faster Multifile READ in 
Exercise 4.6. 


| 88% UTPUT ROUTINES {| As was mentioned in the introductory 
{* € ———————’' remarks of this chapter, output in 
1% € { SNOBOL4 is almost magically simple. Assigning a 
{% & | string to the variable OUTPUT or PUNCH will print or 
{ £888 | punch the string respectively. Moreover, it does 


t_—-_5 not have the problems that ‘input has; i. e. trans- 
mission is not typically tentative depending on the value of 
the string and output files are not sequenced like input files 
may be. But there are problems nonetheless. For one thing, 
printed output must appeal to the human eye which means ver- 
tical as well as horizontal allignment and this generally is 
difficult to do when simply outputting strings. For the same 
reason, overstriking, which calls for a perpendicular allign- 
ment is equally awkward and unnatural. Both of these obstacles 
are overcome quite easily with the use of the block datatype, 
a discussion Of which is deferred until a later chapter. 


For this chapter we will consider only basic card output; 
i.e€., output which is meant to be read by some other computer 
program. 


SS ee 

{{ Program {| Just as it is good practice to focus input 
{1 9.7 tt activities into a single function, so it is 
(I PUT tI a good idea to do the same for output. PUT 
t_—_—_______—-—— is a function which will accept as argument 
a string (of no greater than 72 characters) and print this 


card labeled and numbered in the identification field (columns 
73 through 80). It will also punch what is printed. 


Labelling is effected by the user of PUT by assigning a string 
to the variable PUT_LABEL. Thus 


PUT_LABEL = 'PUT' 


will set this label to equal the indicated 3 letters. 
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Numbering of cards is by increments of 1. Sometimes it is 
desired to increment by a number other than 1 which is accom- 
plished by setting the value of PUT_INC. Thus 


PUT_INC = 10 


will set the increment to 10. 


{ PUT(L) will output L (presumed to be a card image). It 
{ will label the OUTPUTted card starting in column 73. The 
{ user may specify the label by assigning a string to the 
{ variable PUT_LABEL. The cards will be numbered in incre- 
| ments of 1; the increment can be changed by assigning an 
{ appropriate value to PUT_INC. 

resists seh insignis ems hemes ish ications 


DEFINE (*PUT(L) ‘) 


PUT_INC = 1 
: (PUT_END) 
PUT PUT_N = PUT_N + PUT_INC 
OUTPUT = RPAD(L,72) PUT_LABEL 
+ LPAD(PUT_N, 8 ~- SIZE(PUT_LABEL) ) 
PUNCH = OUTPUT : (RETURN) 
PUT_END 
Names_referenced Name Type Where _ defined 
by_ PUT: LPAD Function Program 3.2 
RPAD Function Program 3.3 
Epilogue 


Note that when OUTPUT is used on the right hand side of the 
assignment (last executable statement) the value last output 
is used as value and no OUTPUTing of information is implied or 
inferred. 


For debugging purposes, it is perhaps prudent to. turn punching 
off. This can be done either by removing the assignment to 
PUNCH or by executing the statement: 

DETACH (. PUNCH) 


The latter is preferred since when it comes time to actually 
punch, it will be obvious what to do. 


ae ee ee . 

{{ Program {| In the description of FORTREAD (Program 9.2) 
| 9.8 tf several examples of FORTRAN source proces-~- 
{' FORTPUT {| sing were given. In three of these examples 
__——__________ (preprocessing, conversion and housekeeping) 


the output is also FORTRAN and, in such cases, the programming 
situation can be simplified by writing an output function spe- 
cially designed for FORTRAN statements. 
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Gn en ee ee ee ee ee pe te ee ig ee ee te ae hE a ON 
{ FORTPUT(S) will output a FORTRAN statement S. The card | 
{ will also be punched, labeled, numbered, and continued if { 
| necessary. { 
Oa ed 
DEFINE ('FORTPUT (S) T') : (FORTPUT_END) 


Near ge CO ae ne te ay ee pee ge oe en et eg ee age ee Pe nee eee eee eee 
{ Entry point: Remove initial chunk from S; output it; check | 
{ for completion, if so return. | 
ss vhs semen mee ieee techincal lengli escent 
FORTPUT s (LEN(72) { REM) .T = 

_ PUT (T) 

IDENT (S, NULL). :S (RETURN) 
Gee fe ee ee a ee ge eg er ae a ee ee eee ee eS ee ee Pe Ow 
| Since something is left in S we must sunvply a continuation | 
{| card. The location field of this continuation card (the | 
{| first 5 characters) must be kLlank. { 
5c cme shecam itso li pans Sti eee em as pn atom stabs ine inanceniatneiomnaasiell 


S = DUPL(' ',5) ‘'1' S : (FORTPUT) 
FORTPUT_END 
Names_referenced Name Type Where _ defined 
by_FORTPUT: PUT , Function Program 9.7 
aaa a enacts, | 


{1 Program {| SNOBOL& statement outputting (which we do 
tt 9.9 tt next in Program 9.10) is more complex than 
| ti FORTRAN outputting attributable to the fact 
t- —-———___-—___—4 that a SNOBOL4 statement cannot be split ar- 
bitrarily but only at a point where a blank may appear (but 
not within quoted literals). The determination of a suitable 
break point in a SNOBOL4 statement will be done by the func- 
tion PEEL. This function is being isolated because it can be 
used for other purposes such as compressing and reformatting 
SNOBOL4Y statements. Also, a slightly modified version of PEEL 
can be used for finding break points in JCL (Exercise 9.8). 


PEEL(name, n) will peel off and return a prefix from the named 
string. The prefix is to be as large as possible but not 
longer than n characters. The named string will be modified. 
The prefix will be broken off from the named string only at a 
suitable break point defined as follows. The break may never 
appear within quotes. Given this first condition, it may occur 
before any of the characters in BEFORE or after any of the 
characters in AFTER. If no prefix can be found other than the 
null string then PEEL will fail. 


PEEL has a side effect. In addition to returning a value, it 
will modify a part of the outside world. In particular, it 
will remove a prefix from the string named by the first. argu- 
ment. The modification of supplied arguments can only be 
accomplished in SNOBOL4 by passing as argument the name of the 
variable. Thus to remove a prefix from the string S the call 
to PEEL must be of the form 


ee Eee Ee eee Cece ems emer esee So 


PEEL (.S,n) 


(the call PEEL('S',n) although equivalent is not recommended 
because it does not provide as good documentation and in some 
implementations is less efficient). This method of denoting 
arguments is a bit unusual inasmuch as the arithmetic 
languages, FORTRAN, PL/I and ALGOL permit functions to modify 
argument variables without the encumbrance of an initial 
period. At first, the initial period appears to be something 
of a nuisance. As it turns out, however, it has the important 
advantage of alerting the reader to the possibility of side 
effects. 


{ PEEL(NAME,N) will peel off and return a prefix from the 
{ named string. The prefix is to be as large as possible 
{ but not longer than N characters. The named string will 
{ be modified. The prefix will be broken off from the named 
{ string only at a suitable break point. The break may never 
{ 
{ 
| 
{ 


appear within quotes. It may occur before any of the 
characters in BEFORE or after any of the characters in 
AFTER. If no prefix can be found other than the null 


string then PEEL will fail. 
| Sere a NE ee EE EE ee TN 


DEFINE ('PEFL (NAME. ,N.) K1.,K2."°) 


BEFORE = ') ,>! 
AFTER = '( ,<! 
PEEL.K2. = POS(0) TAB(*K1.) (ANY(AFTER) @K2. | 
+ BAL(,'"* "*"%) (@K2. ANY(BEFORE) | ANY(AFTER) @K2. | 
+ RPOS (0) @K2.)) 


: (PEEL_END) 


Gee eS PE RL wR ER EE pe ee ge ee eT Ee” Ne OAT ep ee Toes ON 
| If the NAME.ed string is no longer than N. characters, | 
{ return the value and null out the variable. 1 
| Sn ee | 


PEEL LE (SIZE( $NAME.) , N.) :F (PEEL_1) 
PEEL = $NAME. 
$NAME. = : (RETURN) 


= 
| Otherwise we scan for a break point in the named string. 
{ Our search begins after the K1.th character (K1. is ini- 
{ tially 0) and assigns’ the numerical value of the break 
{ point to K2. Ultimately K2. exceeds the value of N. at 
{ which point we transfer to PEEFL_2. 

(Oa esensesassseves anys tthe shunt mehr ls pets uh Pls sree oveveserenasvnitentsstreivenasuresnarel 


PEEL_1 $NAME. PEEL.K2. :F (ERROR) 
GT (K2.,N.) 2S (PEEL_2) 
K1. = K2. : (PEEL_1) 


Go NT ce ge ee ee ee ee eT ee ee Tw 
{ The breakpoint is now indicated by K1. and provided it is | 
{ not zero we can return normally. ] 
| ne TE rT eT TE a EE | 
PEEL_2 EQ (K1.,0) :S (FRETURN) 

$NAME. LEN(K1.) . PEEL = : (RETURN) 
PFEL_END 
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Names referenced Name Type Where defined 
by PEEL: BAL * Function Program 8.3 


* indicates name is referenced in the initialization section. 


Epilogue 
PEEL is not as fast as it could be. The pattern PEEL.K2. ad- 
vances by 1 character at a time until overflow occurs. The 


inefficiency is normally not troublesome because PEEL will 
normally be able to return the entire string without having to 
search for a break point. Nevertheless, some applications 
might call for a faster PEEL and Exercise 9.9 outlines a 
method for increasing the speed as well as increasing the 
selectivity as to where kreaks may occur. 


The names of parameters and temporary variables (viz. NAME., 
N., Ki. and K2.) were deliberately made strange so as to 
reduce the chances of duplicating the name passed as first ar- 
gument to PEEL. This issue is discussed fully in the Epilogue 
of the SWAP routine (Program 3.14). 


o_o 

{1 Program {] The function to output SNOBOL4 statements is 
If 9.10 tt shown in Program 9.10. PEEL has greatly 
{| SNOPUT | simplified its writing. 

, HERS eee eee EN 


CS ee ne ee ee ee ee ee ge ae ae ag ee eee ae 
- | SNOPUT(S) will output a SNOBOL4 statement S. It will han- | 
{ dle automatically: labeling, numbering, punching, and, if | 
| necessary, continuation. | 
Neen insincere Amini phism imran li ini aoe nna sesame 

DEFINE (*SNOPUT (S) *) 
: (SNOPUT_END) 


Wo eo ee ed ye PR ee ee eR Nee TE he ae Eee, aye IE gee Te Ry To eee | eee Ne 
{| Output the first 72 characters (breaking appropriately). | 
ae en eects re PP PsP sr ls esl ers sie eennensonnocnananencmmcemawcl 
SNOPUT PUT (PEEL (.S,72)) :F (ERROR) 

Woe a ge are ee se RT See a Ce ate ae ee ee ee ee ate ee 
{ If S is null we are done, otherwise peel off the next 71 | 


| characters and prefix with a continuation (+). Continue | 


{ to do this until S is null. ( 
Sonics cpr dicen ilps psalm sl iinet ais dna 


SNOPUT_1 IDENT (S, NULL) 3S (RETURN) 
PUT('+' PEEL (.S,71)) :F (ERROR) S (SNOPUT_1) 

SNOPUT_END 

Names referenced Name Type Where defined 

by_SNOPUT: PUT Function Program 9.7 


PEEL Function Program 9.9 
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CC ee eee 
{ Exercise 9.1 {| Extend the basic READ routine so that it can 
uJ. operate like a pushdown stack. thus 


PUSH ("ABC') 
PUSH (*XY¥Z') 
A = READ() 
B = READ('S®) 
C = READ ('YZ') 
D = READ() 


when executed will cause the following values to be assigned. 


A = "ABC! 
c = ‘xyz! 
D = the next input card. 
The PUSH & POP routines (Progs. 5.5 & 5.6) may be used. In 


fact, the PUSH above is assumed to be exactly Prog. 5.5. 


—.,-- ee 

{ Exercise 9.2 | Modify PARAGRAPH so that the start of the 
t-________-____-- I next paragraph is denoted by a pattern given 
‘to PARAGRAPH as argument. You may use the modified READ given 
in Exe 9. 1. 


SS 
{ Exercise 9.3 {| Modify FORTREAD so that it returns the 
L___—._..__.J FORTRAN statement with all extraneous blanks 
removed (i.e., blanks not in positions 1 through 6, not within 
quotes, and not within a hollerith field (nH...)). 


CH oo a ee ee 
| Exercise 9.4 { Modify TREEREAD to accept trees whose struc- 
i-_________J_ ture is denoted by 


(a) indentation (allow sons to have any indentation greater 
than their fathers) 


(b) numerical values without the restriction that level num- 
bers increase in steps of 1. 


In each case assume that the value of a node is some nonnull 
quantity. 


Coors gee ee 

{ Exercise 9.5 | Use READ to write a function called ASMREAD 
t__—___._--__—-—J_ which is to read in statements from IBM's 
0S/360 assembly language [IBM360b]. The fact that a given card 
is to be continued is denoted by a nonblank in column 72 but 
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this character is not considered part of the statement. The 
next following card (incredibly) must have blanks in columns 1 
through 15 and these blanks (but no following blanks) are 
ignored when building the statement. ASMREAD should fail if 
an inconsistency is encountered in one of the continue 
conventions. 


eee ee 

| Exercise 9.6 {| Write a multifile READ which avoids most of 
i_______________3§ the inefficiences of multifile reading in 
the following way: When READ is called, control is directed to 
the label ‘READ_‘ F where F is the file name. The statements 
transferred to can be compiled at runtime (using the CODE 
function) at the first use of file F and can be ‘custom-made! 
for the particular file name. 


ura ebctreririarn tas: | 
‘| Exercise 9.7 | Given the tab mechanisms of keypunches and 
L.-J teletypewriters, it is easier, in typing, to 
left-justify elements within fields whereas many applications 
(especially numerical) call for right justification of ele- 
ments within fields. 


(a) Given an 80-character string (card image) in the variable 
S, write a single statement to right justify any left- 
justified element in the field which starts in colum 
numbered C and whose length is L. You may use LPAD and/or 
RPAD (Progs. 3.2 & 3.3). 


(b) Use (a) as the basis for a program which will right- 
justify elements in a deck of cards. The first input card 
contains a sequence of X's in each field to denote their 
locations. This can be converted to a sequence of number 
pairs and then (a) can ke repeated for each number pair 
and each card. 

| etree era reamamree nee | 

{ Exercise 9.8 { (a) Using READ, write a function (called 

LJ. JCLREAD) which will extract a complete JCL 

statement [IBM360c] from the input stream (let it pass over 

and output all non-JCL). Delete unnecessary blanks between a 

control card and the following continue. Remove all comments. 


(b) Write a function to output JCL. (Hint: PEEL can be 
used.) 


(c) Test the two functions by replacing ina set of JCL 
statements every occurrence of ' *DSNAME=* by 
*DSNAME=LIBRARY.'*'. 


Coco Oe Cee ae ee Z : 
{| Exercise 9.9 { To improve the operating speed of PEEL 
t-._-________-—JI (Prog. 9.9) one may search over nonbreaks 
and/or decrease the number of break points. 
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(a) Write a pattern which behaves like PEEL.K2. but which 
uses FASTBAL, Prog. 8.4, to rapidly scan over characters 
which are not significant in determining break points 
(viz. BEFORE, AFTER and the quotes). 


(b) If we reduce the break set (say AFTER = '=' and BEFORE = 
's') then we will have higher speed and the break points 
will be more aesthetically placed. There is the danger, 
however, that a nonnull peel cannot be made. Rewrite PEEL 
so that if it runs into difficulties with the given 
BEFORE and AFTER, it temporarily uses a stronger version 
of PEEL.K2. (richer BEFORE and AFTER) to crack the given 
statement. 


SS 
{ Exercise 9.10 {| (a) Let the variable NAME. have the value 
ee eee ee ee 


‘IABEL SUBJECT PATTERN = OBJECT 3: (LABEL) '! 
What value is returned by the call 
PEEL ('"NAME. ', 35) 


(b) Modify PEEL so that if the name given is a forbidden 
name, PEEL will go to ERROR. 


SS 
{ Exercise 9.11 | Using SNOREAD and SNOPUT write a SNOBOL4 
L—_.___-___--_-____ program to process other SNOBOL4 programs 
such that every call to the function ALPHA is replaced by a 
call to the function ALPHANUMERIC. 


aaa area ar riaesS | 

| Exercise 9.12 | Using SNOREAD and SNOPUT write a program to 
L_____...—__-—J squeeze out extraneous blanks from another 
SNOBOL4 program. Be sure to pack as many statements on a line 
as possible. 
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arama | 

i, r— he paragraph you are reading now has been formatted by 
{ a computer directed by the very programs we will 
{ describe in this chapter. Paragraph formatting is a 
| special case of the more general activity known as 
ut text formatting. Whereas the former activity is 

limited to the shaping of individual paragraphs the latter ac- 

tivity is more open-ended and includes page layout, pagina- 

tion, etc. 


What, the reader may ask, is so complicated about decomposing 
a paragraph into lines that we must spend an entire chapter in 
its discussion? If all that were involved in this process were 
the cutting of lines at convenient blanks and padding with 
blanks to right-justify margins, then we could dispose of the 
subject in about a page of text and 6 lines of code. But the 
task is complicated considerably by the seemingly minor 
details of backspacing, underscoring and hyphenation. Though 
the need for overstriking is relatively rare, it does exist 
and just as much code need ke written if we are backspacing 
occasionally as frequently. In fact, it is quite normal that 
90% of execution time of a program is spent in only 10% of it. 
A grasp of this fact and its implications toward optimum 
programming is not always fully appreciated. All too often, 
programmers care Only to get the program performing as expec- 
ted without regard to efficiency considerations or, to the 
other extreme, have a compulsive urge to optimize every bit of 
it. Both miss the sound central approach of implementing ef- 
ficiently that portion which is used most frequently. In this 
chapter we will have ample occasion to employ this principle 


In Program 9.3 we showed how to read in a paragraph and in 
this section we will format it. Between these two activities, 
the paragraph may undergo conversions in what we will refer to 
as the pre-processing stage. If the original input device were 
a keypunch, then almost certainly some kind of upper to lower 
case conversion would be necessary. More generally, if charac- 
ters appear on the printer which are not available on the 
input - device, a conversion is necessary to produce those 
characters. Another instance in which conversion is used is 
in the indication of variable information such as figure num- 
bers and exercise numbers. In a sophisticated text processor, 
these will be given in symbolic form to be converted to actual 
numbers when the text is printed. 


We will assume that, possibly as a result of this pre- 
processing, the input text will possibly contain the special 
characters BSPACE and USCORE. BSPACE, as its name implies, 
will permit the user to overstrike print characters. We will 
denote this character by backarrow (+) so that 'O+/' will 
print as '@'. Just what character the user types to obtain a 
BSPACE in his text is determined by the pre-processor. In the 
system used to prepare this document, the symbol '-' was used. 
Backspacing complicates such issues as separating a paragraph 
into lines and printing a line on a device which does not 
directly support the backspace character (such as a printer). 
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It also serves to cloud the issue of when a line equals 
another line. 


Overstriking can extend the set of characters which one can 
print. Several examples of interesting overstruck combinations 
are shown in Table 10.1. 


Cr a re ee ee pen TE PRS ee ae ee BO ee a eae ee ge ae 
Table 10.1 Characters obtainable 
via overstriking 


(cent sign) 
(dagger) 
(double dagger) 
(not equal) 
(division) 
(symbolic blank) 
(right arrow) 
(left arrow) 
(Theta) 

(Phi) 

(Gamma) 


(Lambda) 
[a cn CR ee ne a EE ES | 


SNDOAVE** 1i——7 

AINE Ntt— 
YNMNQOAVESF K++ eG 
<n em ee es ee ee ee ee es ae ee es 


aea~mNX | 


USCORE is a character which appears in pairs and indicates 
that any material between them is to be underscored. Ina 
sense, underscoring is a special case of backspacing but, in a 
sense it is not. For example, we are permitted to break lines 
at blanks and expand lines at blanks for the purpose of for- 
Matting paragraphs. But we would also like to be able to break 
the line: 


"A quick brown fox really did jump over..." after the "really" 
so that we might prints: 


A quick brown fox really 
did jump over... 


Note that not only are we breaking at a nonblank, we are ac- 
tually discarding a character. If the underscore. character 
("_") were treated as a break character, then there may be 
difficulties with formatting paragraphs which contain '_'. One 
example of this is the paragraph you are reading now. Another 
example is 


"Printing the string 'A Be-+__' yields 'A Bt." 
In the above case it becomes not merely awkward but actually 


impossible to disentangle that which is regarded as under- 
scoring from that which is overstriking. 
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The USCORE character is inserted into the text by the pre- 
processor and is not actually typed by the user. The way in 
which the user will indicate underscoring will depend on the 
input device. In the system which formatted this text (and 
which is oriented toward key punch input) the underscore 
character ('_') is used to denote that the following word is 
to be underscored and a sequence of the form __ ... _ in- 
dicates underscoring of an arbitrary string of characters. In 
a system oriented toward teletype input the sequence 


n-characters n-backspaces n-underscores 
could be translated by the pre-processor, into 


USCORE n-characters USCORE 


cm 


{| Program {ff Backspace normalization is the process of 
11 10.1 | converting a string with backspaces embedded 
tI BNORM { in it into a string which prints identically 
L_—__________ to the first but in which no 2 backspaces 
occur consecutively. Thus 'ABCD-<+<<-1234' is translated into 
"A+ 1B+2C+3D<4'., This serves to localize the effect of 
backspacing simplifying later processing. It also serves as a 
necessary prelude to image normalization as described in 
INORM, Program 10.2. 


To describe rigorously what is meant by B-normalization, we 
define the spacing of a string as equal to the number of 
characters in the string minus twice the number of BSPACE's 
and minus the number of USCORE's. Thus, the string ‘AB+C*' has 
a spacing of 4-2(1) = 2. The string 'AWB-Cwi' (where w is the 
USCORE) has aé_e spacing of 6 - 2(1) - 2 = 2. Informally the 
spacing of a string equals the net movement of the type ball 
(or equivalent mechanism) when the string is printed ona 
teletypewriter. Note that the spacing can be negative as in 
the string '<<A'. 


We define a prefix of a string as any initial sequence of 
characters of the string. Thus, *PR' is a prefix of the string 
"PREFIX'. In general, a string of n characters will have n+1 
prefixes including the null string and the string itself. 
Similarly, a suffix is any terminal sequence of characters. 
More formally, P is a prefix of S if there exists a string T 
such that 


P T = Ss 
and F is a suffix of S if there exists a string T such that 
T #F = S 
A string is said to be balanced on the left if the spacing of 


each of its prefixes is nonnegative. Informally, if, when 
printing the string, we attempt to force the typeball beyond 
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the left margin of the paper, the string is not balanced on 
the left. In a similar way, we define a string to be balanced 
on__the right if all of its suffixes have nonnegative spacing. 
Informally, a string is balanced on the right if its maximum 
rightward movement is reached at the end of the string. We 
call a string balanced if it is balanced on the left and on 
the right. 


Examples of strings unbalanced on the left are '«ABC* and 
'AB<++-+__'; such strings cannot generally be printed and are 
almost certainly errors. Any interpretation short of abnor- 
mally terminating the run will probably be an acceptable one. 
Strings unbalanced on the right such as 'FOB-+/' or 'ABC<' are 
not errors and have well-defined meanings. 


Let a character c which is neither USCORE nor BSPACE be embed- 
ded in the string S as 


Ss = S, ¢c So 


Then the position number of c is defined as equal to the 
spacing of S, plus 1. We refer to the characters of S other 


Let S be a string without USCORES. Then the B-normalization 
of S is defined as that string S* such that 


1) St is balanced 


2) The position numbers of the characters of S*' are 
monotonically nondecreasing. 


3) The position characters of S' are identical to the posi- 
tion characters of S and each such character retains its 
position number and, moreover, any pair of characters 
having identical position numbers retain their relative 
ordering in S' as they had in S. 


As an immediate consequence of the definition, all position 
numbers in the B-normalization of a string are nonnegative. 
Hence, strings unbalanced on the left having negative position 
numbers: will not have a B-normal form. On the other hand all 
strings balanced on the left have a unique B-normalization 
which can be produced by construction. This follows because 
items 1) and 2) assure us that S' is a sequence of substrings 
each representing one print position having the form: 


"C4+Cot. ee Cy! ‘ 


where n21 and in general varies with the print position. The 
characters Cy, Cas «++ »Cn each have the same position number. 
Note that they all must retain their relative ordering. This 
is done not merely to make B-normalization unique, but also 
because we do not know the intended purpose of the 


backspacing. Thus, Cy+c, is indistinguishable from cg+c, when 
printed but if we choose to interpret ‘'+' as subscript or 
superscript the ordering is important. 


If S contains USCORES the situation is complicated slightly. 
What are we to make of , 


'FOM<-/RTRANM* 
Should it be 
'FORTRAN' or ‘FORTRAN! 

Obviously this is a mistake. The string to the right of ‘mM! 
should be balanced on the left so that the 'H' is not shifted 
to the right of characters which appeared after it. Similarly 
the string to the left of '#' should be balanced on the right. 
Hence we define the B-normalization S*' of the string S where 


s 


Sy 485 
as 
St = St w So! 


where S,' and S2' are the B-normalized versions of S, and Sg, 
respectively. Of course, S,; and Sz may either or both contain 
USCORE's in which case the definition applies recursively. 


Se ES re te eee ee 


If any string S is balanced on the left, then REVERSE(S) is 
balanced on the right. Conversely, if S is balanced on the 
right, then REVERSE(S) is balanced on the left. 


Proof: The proof is simple but instructive. If S is balanced 
on the left then all prefixes of S have nonnegative spacing, 
by definition. If P is a prefix of S then REVERSE(P) is a 
suffix of REVERSE(S). Since the spacing of REVERSE(P) is the 
same as the spacing of P the spacing of the suffix is nonnega- 
tive. Since all suffixes of REVERSE(S) correspond in this way 
to some prefix of S, we conclude that S is balanced on the 
right. In a similar way we can prove the converse. 


Proposition 10.2. o& 


If S, and Sg are right-balanced then S, Sz, is right-balanced. 
Similarly if S, and Ss, are left-balanced then Sy Ss is left- 
balanced. 


Proof: Any suffix of Sy Sz is either a suffix of Sz, in which 
case its spacing is nonnegative or is of the form F S, where F 
is a suffix of S,. But the spacing of F Sz = spacing F + 
spacing S»z and hence is also nonnegative. Hence S, Ss is right 
balanced. In a similar way S,; Sz, is left balanced. 
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Proposition 10.3 

Every suffix of a right-balanced string is right-balanced, 
Similarly every prefix of a left-balanced string is left- 
balanced. 

Proof: is obvious. 


An algorithm to B-normalize a string § containing no USCORE's 
is given below: 


(i) Reverse S$ 


(ii) Apply the following transformation repeatedly until it 
can no longer be applied. 


s NOTANY(B) . X BB ONE_POS . Y = BYXB 


(where B is the BSPACE character and where ONE_POS is a 
pattern which will match the shortest string whose 
Spacing is 1). 


(iii) Remove initial BSPACE's from S. 

(iv) Test for double BSPACE or trailing BSPACE. If yes to 
either question, the original string was not left- 
balanced, respond aPEECEETACerys Otherwise return the 
reverse of S. 


To. illustrate the algorithm, let S be the string 


tabcd<+<<-efgh'. By step (i) it is reversed to form 
thgfe+<+<<-dcba'. Step (ii) is a multistepped process il- 
lustrated in Figure 10.1, yielding the string shown. Step 


(iii) does nothing. Step (iv) reverses the string to return 
'a-eb-fc-gd-h' which is the result sought. 


Step (ii) is the heart of the algorithm and does the fol- 
lowing. . The spacing of (B B Y) is -1. Hence the position 
number of X is higher than the position number of all charac- 
ters in Y. Since in B-normalization the position numbers must 
be in ascending sequence, the X and the Y are interchanged. 
It is for this reason too that the transformation of (ii) must 
terminate since there are only a finite number of inversions 
in the original string. 


Will we he able to reverse all inversions? In order to have 

an inversion we must have at least one double BSPACE. If the 
double BSPACE is not removed by (ii) then it either is at the 
beginning in which case it is removed by (iii) or the sequence 


NOTANY(B) B B 


occurs in S but is not followed by ONE_POS. This implies that 
S is not balanced on the right; the transformation indicated 
in (ii) preserves right balancing (the proof of which is left 
as an exercise) so this implies that the original reversed 


as wed Program 10.1 - BNORM ieee Page 195 


String was not right-balanced. This implies by Proposition 
10.1 that the original string S was not left-balanced. 


The definition of ONE_POS can be given recursively as: 
ONE_POS = NOTANY (B) 1 B *ONE_POS *ONE_POS 


this definition while ‘correct! could prove impractical. Let 
us assume that 100 backspaces appear consecutively. Then 
ONE_POS will descend to 100 levels before matching. Though 
there is no inherent limitation on the number of recursive 
levels to which we can plunge, there are often practical 
limitations, and this will, in general, depend on the im- 
plementation. Since the limit on the recursive depth has been 
known to be less than 100 for some implementations and _ since 
100 consecutive backspaces, while unusually large, is not an 
unreasonable quantity, we must seek a solution. We solve our 
problem by scanning first for a group of BSPACE's (viz. 5 of 
them) and only if the group is not there do we choose to try 
the case of one BSPACE. Thus 
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ONE_POS = NOTANY(B) | 
+ DUPL(B,5) FENCE ¥*FIVE_POS *ONE_POS | 
+ B *ONE_POS ¥*ONE_POS 

FIVE_POS = ONE_POS ONE_POS ONE_POS ONE_POS ONE_POS 


The maximum recursive plunge becomes ({k/5] + REMDR(K,5) where 
k is the number. of consecutive BSPACE's. If recursive levels 
of 70 are permitted, we can tolerate k<338. We can use the 
same basic scheme to achieve even longer lengths of -consecu- 
tive BSPACE's but 338 should suffice. ; 


Note the effect of FENCE. If it were not there our clever 
scheme would be thwarted if a long sequence of BSPACE's ap- 
peared in a string which was unbalanced on the left. The 
reason is that, as we have discussed earlier, the right-most 
*ONE_POS will fail. Without the FENCE the alternate 
B *ONE_POS *ONE_POS will be tried. We will ultimately recurse 
as many levels as there are BSPACE's only it will take longer. 


Oe ee ne a ge ee ee ee ne pe ee de ee ed ee 

| BNORM(S) will return the B-normalization of the string S. | 

{ Blanks will be prepended to S if it is not balanced on the | 

{ left. 1 

| En En | 
DEFINE (" BNORM (S) B,S1,S2,X,Y,P') 


| Initialize patterns { 
| RIC ee | 


ONE_POS = NOTANY(BSPACE) 
+ { DUPL(BSPACE,5) FENCE *FIVE_POS *ONE_POS 
+ {| BSPACE *ONE_POS *ONE_POS 
FIVE_POS = ONE_POS ONE_POS ONE_POS ONE_POS ONE_POS 
IF_BSPACE = BREAK(BSPACE) 


: (BNORM_END) 
Seo og Se eR ee ee ee ep eg 
{ Entry point: First make a quick scan to see if any | 
| backspace character exists in S. If none such, return { 


{ immediately. { 
ce a ae a 
BNORM S IF_BSPACE :S (BNORM_1) 

BNORM = S : (RETURN) 


APR a aE BE AE EERE ES IESE ee ERE RE ETE GR! | 
{ Are there any USCORE's? If so, subdivide and recurse. { 
| FES enn ee | 
BNORM_1 S BREAK(USCORE) . S1 USCORE REM . S2. :F(BNORM_B) 


BNORM = BNORM(S1) USCORE BNORM(S2) s (RETURN) 
ee ne ee 
{ Reverse the string and apply the transformation described | 
| in the text. | 
rere nee Cr | 
BNORM_B S = REVERSE(S) . 

B = £=BSPACE 

P = NOTANY(B) . X B B ONE_POS . Y 

BNORM_2 S P = B Y¥ X B :S (BNORM_2) 


CS oe ne Ra ee pe ge a a ee ee et ae ee ge a EE ee ee 
{ The transformation has been applied as far as it will go. | 
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| Remove leading BSPACE's. | 
Mh sence Sianeli spent osm mnie aren ni iin tadimaiciaioacantaamaga 


Ss POS(0) SPAN(B) = 


Care rg ae Re Ee gL NCE ae ee EE EE we eT BN ee Te SR ee ee 
{ If a double BSPACE or trailing BSPACE remains, add a blank |{ 


{| to S and try again. Otherwise reverse and return. { 
Ms sins me telat cp ices eeeaipmincical 
Ss BB :S (ENORM_UNB) 
BNORM = REVERSE(S) 
BNORM POS(0) B : F (RETURN) 
BNORM_UNB S = S ! '! : (BNORM_2) 
BNORM_END 
Names_referenced Name Type Where _ defined 
by_BNORM: REVERSE Function Program 3.6 
, BSPACE * Character 
USCORE Character 


* indicates name is referenced in the initialization section. 


Epilogue 


BNORM was written under the assumption that most paragraphs do 
not contain USCORE's or BSPACE's. Such paragraphs are handled 
as efficiently as possible. Other paragraphs are not treated 
as quickly as could be done. Specifically, patterns are not 
predefined where they could he. The scanning for the pattern 
P could be replaced by a more elaborate process so that double 
PBSPACE would be found rapidly via BREAKX. Similarly, the 
double BSPACE check at the end could also be done more rapidly 
using BREAKX. Another improvement might be to handle the spe- 
cial case of 


n-nonBSPACE's n-BSPACE's n-nonBSPACE!s 


by a variant of the BLEND operation. But such sequences are 
likely to be used in the case of underscoring so that the pre- 
processor would be expected to catch this special case. 


Given our assumptions, however, none of these changes seem 
warranted, since, for seldom used code, we want to be guided 
more by the desire to save program space (which is also worth 
money) than execution time. If the ground rules’ change, 
rewriting according to the above principles may be indicated. 


Note that if S is not left-balanced, BNORM(S) returns a 
balanced string which is similar to S. An alternate approach 
would be to have BNORM fail. In the latter case, however, the 
calling subroutine would have to specify recovery operations. 
This can become a continuing nuisance and can be all the more 
irritating because it involves a case which probably will 
never occur. 
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{{ Program || Image Normalization, or I-normalization is 
tt 10.2 i the process of converting a string having a 
tI INORM 1 given printed image into a unique represen- 
hen ed tation for that image. Thus, the string 


'0-/' and '/<+O' when printed, will have identical printed 
images, viz. '@'. Also, the image produced by 'xX- ' is the 
same as the image produced by simply ‘X' implying that over- 
struck blanks may be dropped in I-normalization. The reason 
for I-normal form is to ke able to determine equality of prin- 
ted images based on the characters used to produce the images. 
In addition, we would also like to scan a string which 
produces an image to determine whether a subimage appears 
within it. For example, suppose, in a time-sharing system, a 
programmer had typed in the phrases: 


",.. such a string is called a convoluted rope." 


and he wishes to change something in the string. Most time- 
sharing systems have editors in which one can specify a _ sub- 
string to be searched for and a replacement to be made, so 
that the user could say in effect 


change ‘rope to ‘'string' 
Assuming that USCORE is not being used and that no normaliza- 


tion exists, the above substitution request could result in 
the string 


Since 'rope' has fewer characters than ‘'string', the under- 
lining is no longer correct. To compensate, we may request 
the editor to 


change ‘rope--<-< ' to 'string-<<-<< ‘ 


We may obtain the desired result, but then again we may not. 
If, in the original, we had typed ‘trope' before underscoring 
‘convoluted! this particular string sequence would not be 
found. Moreover, if we had typed the period before under- 
scoring ‘rope! we also could not make the indicated replace- 
ment. If, in the latter case, we made so simple a request as 


change '.' to 't 
we might obtain 
"... such a string is called a_convoluted_rope" 
This state of affairs can be quite frustrating, especially 
when repeated attempts to make replacements result in failure. 


Image normalization will permit us to escape from this 
malaise. 


| eeeaeeraiaasine 
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Earlier we mentioned that B-normalization is a necessary 
prelude to I-normalization. That this is true is a deriveable 
result. 


By an image we mean a configuration of printing on paper, 1 
character high and 0 or more characters wide. We may speak of 
concatenating images just as we concatenate strings. Let the 
image I be produced by each of the set of strings Sy, So, ... 
where the sequence goes on indefinitely because there is no 
limit to the number of backspaced blanks that can be added 
without changing the image. Let N(S) be the function which 
converts a string to its I-normal form. If N(S) is working as 
it should then N(S,), N(So), ---. will all produce the same 
string. Hence we can meaningfully speak of N(I) where I is an 
image. The value of N(I) will be N(S) where S is any of the 
strings which produce I. If, for example, N('0+/') happens to 
be '/+0', we may say that N('@*) equals '/<+0O!. 


Our intended purpose is to be able to scan a given image I for 
a subimage I' by scanning N(I) for N(I'). This implies that 


N(Iy IT2) = N(Ta) N(T2) 


that is, the function must be homomorphic (with respect to 
concatenation of images). This is important because it means 
that the function N() is completely specified by a knowledge 
of N(I) where I ranges through all single print-position 
images. (See Chapter 3 for a further discussion of homomorphic 
functions.) 


The notion of normal form implies that the thing considered 
'‘normalt is actually a member of the class it represents. That 
is, if Sy, So, .-. is the set of strings corresponding to 
image I then 


N(I) = Sn 


for some n. If, moreover, we make the normal form irredundant 
in the sense that no characters can be removed without 
changing the image, we are left with the conclusion that the 
normal form of, for example, the overstruck combination A can 
either be 'A+_' or '_+A', but nothing else. Hence, the mapping 
of a single position must be of the form 


Cy*Cot eae *+Cn 
where n 2 1. This observation coupled with the fact that N() 
must be homomorphic implies that a string in I-normal form 
must also be in B-normal form. 
The order of striking is unimportant in the final image 
produced. For example can the reader determine which character 
struck first in the set of overstrikes helow? 


DOOHBS 
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The answer (although not obvious) is that the slash appeared 
first at positions 1, 2 and 4. 


The question of which images are distinguishable is an impor- 
tant one but, unfortunately, is one which depends on the 
equipment used and, to a certain extent, on the discriminating 
powers of the individual. Will, for example, a character 
overstruck with itself produce a different image than if it 
were not so overstruck. Is, for example, ‘A’ different from 
‘At? We will hold that it is and that use can be made of the 
resulting boldface. However, not all media are like printers 
in this respect. The all-or-none characteristic of cathode 
ray displays may prohibit this assumption. Also, some time- 
shared editors (eg. Saltzer {1964]) have been known to nor- 
malize away bold face. 


Another source of ambiguity is that different overstruck com- 
binations can resemble each other. For example 


+ + + 
were produced respectively by the combinations 
thet tpt tt 


Though they can be distinguished when compared, they may not 
be so distinguishable if viewed in isolation. 


Another issue is the non-printable character. As mentioned 
earlier (Chapter 2), most of the 256 EBCDIC characters are 
non-printing. To be consistent with the previous notions of 
image identity, each of these should be converted to blank. 
This we will not do for 2 reasons. Experience has shown that 
use can be made of a character that prints blank but which 
really isn't a blank for the purpose of line breaking and pad- 
ding (so-called hard blanks). Also, the notion of nonprinting 
character is device dependent. The subscripts (such as ',') 
are non-printing on most printers (and most devices) but 
should not be converted to blank each time they appear in 
text. A program is usually not dedicated to a particular 


device and in fact may be in simultaneous communication with | 


2 different devices. In such cases, the notion of non-printing 
character, loses its significance. 


As a result of these considerations, we will assume ae string 

S, of overstruck characters can be distinquished from a string 

Se if and only if 
ORDER(DIFF(S,," ‘)) # ORDER(DIFF(S2,' ')) 


(See Progs. 3.10 and 3.1). This leads to the following defini- 
tion. A string is in I-normal form if 


(1) it is in B-normal form, and 


(2) for every sequence of the form 
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Cy*Cot eae +Cn 


where n>1, the characters are in alphabetic order and contain 
no blanks. 


A string can be I-normalized by placing it in B-normal forn, 
removing overstruck blanks, and alphabetizing overstruck 
characters as is shown below. 


ee ee ee er a ee 
| INORM(S) will return the Image Normalization of the string | 
{ S. | 
[Se ee ee Ee a SE a SOE AEE EE TS RS PC ee ee oe IS EE OT | 
DEFINE (' INORM(S) C,CC,S1,K") 
ree ne ee en ee pe ee eae gene Gey ee pee eT Re Ee ee ee oe ee 
{ Initialize patterns. PR_POS will find a print position | 
{ containing backspaces. { 
a ae np a ne eis ia i rca ei ee em et ea ett ia esos 
PR_POS = POS(0) ARB. S1 (LEN(1) BSPACE LEN(1) 
+ ARBNO(BSPACE LEN(1))) . CC (NOTANY(BSPACE) { RPOS(0)) .C 
: (INORM_END) 
Ge ee ee eg PE ee et ge Tee ee eT ae 
! Entry Point: If no BSPACE's are present, return im- | 
{ mediately. Otherwise B-normalize S before going further. {| 
a a ee 
INORM S IF_BSPACE 3F (INORM_RET) 
S = BNORM(S) 


era re ee ee et ee ee ee rE pe pe fe ge ee Oe ee eg 
{ Look for a print position involving BSPACE. If none are |{ 
{ left, return. Otherwise, ORDER the overstruck characters. | 


{Serres nts et ee ene EE Se a a I TN | 


INCRM_LOOP 


S PR_POS = Cc : F (INORM_RET) 
cC = DIFF(CC,BSPACE ' ‘) 

cc = IDENT(CC,NULL) ' ' 

CC = BLEND( ORDER(CC), DUPL(BSPACE, SIZE(CC) - 1) ) 
INORM = INORM S1 CC : (INORM_LOOP) 


ar a ee eg ET ee ee at ee ee ee ee a eo ey ee he 
{ Common return point. { 
9 sconespnsiaeecnpemset tii mise pg igs qt nescence meron ieitaiesinaiininimaatinemmeeetaitia 


INCRM_RET INORM = INORM S 3 (RETURN) 

INORM_END 

Names_referenced Name Type Where defined 

by_INORM: BNORM Function Program 10.1 
IF_PSPACE Pattern Program 10.1 
ORDER Function Program 3.1 
BLEND Function Program 3.7 
DIFF Function Program 3.10 
BSPACE * Character 


* indicates name is referenced in the initialization section. 
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Epilogue 


Here, as in BNORM, we adopt the view that while it is essen- 
tial to handle the case of no backspace characters rapidly, we 
can take our time with strings in which they are present. In 
particular, if no special characters exist in the argument S, 
control passes to INORM_RET where an exit is made. It seems 
as if an unnecessary concatenation is performed at INORM_RET 
but the system is smart enough to return the other argument if 
one of them is null. 


If the assumption that BSPACE's are rare is invalid there are 
several ways of increasing its speed. One method would be to 
rewrite PR_POS so that BREAK is used rather than ARB to search 
for a BSPACE. The writing of PR_POS is complicated by the fact 
that BREAK carries one further than where one might like to be 
but this can be handled by failing and alternating. See Exer- 
cise 8.5. 


Another method of speedup works on the fact that the great 
‘majority of overstruck positions have only 2 characters at 
that position. Handling of this as a special case can avoid 
the call to ORDER. most of the time. 


rr a iD | 

{{ Program {| Given a paragraph stored as one long string, 
tt 10.3 {1 we will need a function to separate the 
tt 11 paragraph into lines. LINE (CW) will return 
t—____________1 the next cluster of words which will just 
fit within a column width of size CW. To initialize LINE a 
call is made to LINE_INIT(P) where P is the paragraph to be 
decomposed. When LINE(CW) fails no more characters remain. 
Thus 


LINE_INIT('*A QUICK BROWN FOX JUMPED OVER THE LAZY DOG. °') 


L OUTPUT = "! LINE(10) wen :S(L) 
will print 
*A QUICK! 
"BROWN FOX! 
*JUMPED' 
‘OVER THE!* 


"LAZY DOG. ! 


If the global variable JUSTIFY is given the value 1 then the 
right margin is justified. Thus if 


JUSTIFY = 1 


had been executed prior to the calls to LINE(10) the values 
printed would have been: 


"A QUICK! 
"BROWN FOX! 


CT ET a A eT ES a EES ESS A A A AD SE ES ET A A 


' JUMPED' 
"OVER THE* 
"LAZY DOG. ' 


Here, JUSTIFY. serves as a switch and follows the same conven- 
tions as SNOBOL4Y keyword switches (i.e. an integer not equal 
to 0 is on; an integer equal to 0 or null is off). No attempt 
is made to justify the last line or a line in which no spaces 
appear. 


In general, justifying text of small line widths suffers from 
the possibility of words exceeding the column width and single 
word-lines (such as 'JUMPED') not meeting it. These ill ef- 
fects diminish in significance as the column width increases. 
Hyphenation (Program 10.7) also helps in this regard to 
produce a document with less white area. 


Breaking a line at a suitable break point must seem like sheer 
simplicity. If the column width is CW, then go out to that 
position + 1 and start marching backward until a blank is 
found. This should be our breakpoint. But this doesn't always 
work for several reasons. It won't work if we allow the pos- 
sibility of USCORE's and BSPACE's. Consider the example 


"A WQUICK BRO-/WNA FO+/X! 


If the column width is 15, the first 3 words will easily fit 
within a column, but the above algorithm will pick up only the 
first two. This is because the spacing of a string may be less 
than its size. 


Another reason that we cannot use the simple algorithm is that 
a string may be reduced in size by contracting certain sub- 
strings such as converting double blanks to single blanks. 
Such a condensation will, in general, be preferable than ad- 
ding a large number of blanks into the line. In order that 
this technique be effective we must include in our considera- 
tion enough of the paragraph in order to take advantage of any 
conceivable condensation. 


A third reason has to do with hyphenation. Hyphenation al- 
gorithms are not very good unless the entire word to be 
hyphenated is available. 


In all of these cases we need to have sufficient context in 


order to make an intelligent decision as to how to break a 
line. 


Another difficulty has to do with the assumption that all 
blanks separate words. Consider the string 


"A QUICK BROW<+-/ N FOX! 


Here a blank is used to get over the 'W' and not to end a 
word. But we may convert the string to B-normal form to obtain 


Page 204. Chapter 10 - PARAGRAPH FORMATTING __ 


"A QUICK BRO-/W- N FOX! 


From any string we may safely remove either of the combina- 
tions '+ ' or *' +' without changing the image printed. 
Moreover, by making such deletions from the B-normal form we 
will remove all overstruck bianks. Any remaining blanks will 
be regarded as true word separators. 


There are cases when a_ user does not wish to have a blank 
treated as a word separator. (There are some examples of this 
in the preceding paragraph.) In such instances the user of 
the system may inject into his text so-called hard blanks. 
These are any nonprintable character other than blank. As an 
example, the 0-8-2 punch provides the 029 keypunch user with 
such a hard blank. For input devices which do not have a spe- 
cial key for this purpose, the system can provide a special 
character which will be appropriately converted. 


The contractions which should be permitted in a line of text 
will vary with the application, taste and perhaps with the 
column width. Almost certainly, we should be permitted the 
freedom to convert the two blanks which normally separate 
sentences into one blank. Often we may condense strings of 
the form 
punctuation-mark blank 
by removing the blank. For example 
"A quick, brown, angry fox ...' 
could also be rendered 
'A quick,brown,angry fox' 
We can associate with each string S a minimum printing width 
MINP(S) which is equal to SPACING(S') where S' equals S after 
all allowable contractions have been made. Then 
MINP(S) < SPACING(S) < SIZE(S) 
We define a natural break point as the SIZE of a prefix which 
ends in a nonblank which immediately precedes a blank. Thus, 
the natural break points of 
"A wquick, brown, angry foxv jumped ...! 
are 


1 9 16 22 27 34 wee 


Associated with each breakpoint is a spacing. For the above 
example, the spacings ares: 


1 8 15 2% 26 32 «e. 
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Clearly, if a spacing exists such that it exactly equals CW, 
there is no problem. Sufficient context is defined as the 
break-point associated with the smallest spacing equal to or 
greater than CW. Denote this break-point Bz, and denote its 
predecessor B,. Denote the associated spacings (or widths) W, 
and Ws. Then 


W, < CW < Woe 
Denote the associated prefixes X, and Xs. Then 


SIZE(X;) 
SIZE(Xo) 


By 
Be 


Without hyphenation we have 2 choices, either to expand X, by 
inserting blanks or to squeeze Xp. We will assume that the 
aesthetic liability (termed Ugly Factor (UF) in the program) 
associated with inserting a blank is equal to that associated 
with removing a blank (exercises will explore other less sim- 
plistic possibilities). Hence we seek the minimum of 


We - CW and CW- Wy 


of course, if it is not physically possible to shrink X, to 
size, we must use X,. 


If hyphenation is available, we consider each hyphenation 
point in turn and seek to minimize the contraction or expan- 
sion necessary. Also we add an additional cost (of 1) for the 
aesthetic loss due to hyphenation. 


The algorithm to obtain sufficient context (Bs) is simply to 
look at break-points at CW, CW+1, CWt2, etc. and keep looping 
until a spacing is found greater than or equal to CW. Since 
the spacing is less than or equal to the break-point, no 
break-point below CW is needed. To find a break-point at CW, 
however, it is necessary to look for blanks beginning at CW-1. 


Gore era paar ee eR pe age a oe PG GG ae Thee een ee ea eee ST EP LS Bee ad Gaye A tee 
LINE (CW) will return the next line of a paragraph passed | 

to LINE_INIT(). The colunmm width is CW characters. LINE | 

will fail when no more lines remain. If HYPHENATE is non- | 

zero, words will be hyphenated. If JUSTIFY is nonzero the | 

lines will be right-justified (padded with blanks). { 

Aincicccecesieiver ipernity mrkiniciniiipmeca airhsiniaioeaninaehitiata a catta aieD 

DEFINE (t LINE (CW) B,B2,TRY,X2,W,W2,T,RWORD,UF,UF1,'! 

+ - 'K,H,HYPHEN') 

HYPHENATE 1 

JUSTIFY 1 


DEFINE ('LINE_INIT (P)T*) 

&ALPHABET LEN(1) . HARD_BLANK 

: (LINE_INIT_END) 
a ee ee ee 
{ Entry point for initialization: B-normalize the paragraph | 
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| and remove any overstruck blanks from P. { 
nen eee ee nesting ie esnrensnnnennenesienemnnenneemeeenmsemenmnenal 


LINE_INIT P IF_BSPACE :F (LINE_I1) 
P = BNORM(P) 

LINE_I2 P BSPACE. ' ' = 2S (LINE_I2) 

LINE_13 P ' ' BSPACE = :S(LINE_13) 


Replace leading blanks (if any) by ‘hard blanks' (i.e. 
blanks not subject to reduction or expansion). Append a 
blank to make scanning easier. U_SAVED contains an under- 
score if there was an unterminating underscoring left over 


from the last line. 
aa sos ise nob avtin ascent oth semi Se Seco ce ie Smee ssinaas cassia 


LINE_I1 P pos(0) aoe ') @T = DUPL(HARD_BLANK,T) 
P_SAVED = $ 
U_ SAVED = : (RETURN) 


LINE_INIT_END 


pa a ey 
{ Initialize patterns for LINE. | 
ih ec GN Arata A et ma re Heats A AEE SMe Te inca Rea ES See a Ae are Ce Se eee Pere 


SUFFICIENT_CONTEXT.X2 = (LEN(*TRY) BREAK('.")) . X2 
+ @B2 SPAN(' ') @TRY 

FIND.RWORD.T = @T BREAK("' ') . RWORD SPAN(' ') aT 

EXTRACT.LINE = LEN(*B) . LINE (SPAN(' ') { NULL) 

IF_USCORE = BREAK (USCORE) 


: (LINE_END) 


Ca ee ee Oy et Pe EO epee Me eT ET aT EE Tae ee ee TE ee ee ee 

Entry point proper: Obtain sufficient context (B2, X2). | 
If a sufficient context does not exist, go to LINE_SMALL. | 
Keep looping back until a sufficient context is obtained | 
or is determined not to exist. If the spacing, W2, exactly | 


equals CW, this is the desired breakpoint, B. { 
aerate ser sense Ssh see ets sii eee noses veseoasensoneneuancemsnsrssesmemsen 


LINE TRY = CW- 1 
LINE_1 P_SAVED SUFFICIENT_CONTEXT. X2 :F (LINE_SMALL) 
W2 = SPACING(X2) 
GE(W2, CW) :F (LINE_1) 
B = EQ(W2,CW) B2 3S (LINE_2) 


Fy ee ee en a ee ee Be ee ey ee ee ae ee ee ne 
{ Find the last word RWORD in reversed form from X2. ‘From | 
{ the breakpoint T, compute a tentative breakpoint B (this | 
{ is actually B1) and a tentative ugly factor UF (the amount | 
| by which X2 must be expanded). ! 
accesses css ecm einen are si s ss-ries Es ee STE-se eeniennseedeepurasnemmnansseresamnacnessivassasel) 

REVERSE (X2) FIND. RWORD.T 

B = B2-T 

UF = cW- SPACING (SUBSTR (X2, 1,B)) 
Cae gt ee ee Se eG ane eat ae re pan Pe Td lee eh ee ee 
{ Starting with no hyphenation (K=0) and looping for 
{| increasing degrees of hyphenation , determine a) if the 
{ line will fit and b) if the cost of padding plus hyphena- 
| tion (UF 1) is less than the lowest so far achieved. wis 


{| the spacing of the reduced line. 
EO rn eC a Ee TEED | 


~~. -________ Program _10.3_- LINE 


K = 0 

LINE_3 LE(MINP(X2) - K + SIZE(HYPHEN), CW) 
W = W2 - K + SIZE(HYPHEN) 
UF1 = cCW-W 
UF1 = LT(UF1,0) 
UF1 = UF1 + SIZE(HYPHEN) 
GE (UF 1, UF) 
B = B2-K 
UF = UF1 
H = HYPHEN 

LINE_4 K = NE(HYPHENATE,0) 


+ 


| Enter here with B set to break point and with #H 


{ null or t=", 
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:F (LINE_4) 


2S (LINE_4) 


HY PHENATE (RWORD,K + 1) 


:S(LINE_3) 


set to | 


nc err ne sin eri i ee pi cn sane 


LINE_2 P_SAVED EXTRACT. LINE 
LINE = LINE 8H 
LINE = NE(JUSTIFY, 0) 
{ If 


PAD (LINE, CW) 


an odd number of USCORE characters appear in LINE, set | 


{| the value of U_SAVED to USCORE to be tacked onto the next | 


{ line. 


nn een  e rtnnneaenatetaenanrntnersteeerenereememnncameemnewll 


LINE_USCORE 


LINE = U_SAVED LINE 
LINE IF_USCORE 
U_SAVED = DUPL(USCORE, 
LINE = LINE U_SAVED 


:F (RETURN) 
REMDR (COUNT (LINE, USCORE) ,2) ) 
: (RETURN) 


SS SS eS SS ee ee 
{ Entering here means that whatever remains is small enough | 


{ to fit in a line. 


If nothing remains, FAIL. 


| Ee | 


LINE_SMALL 


IDENT (P_SAVED, NULL) 
TRIM (P_SAVED) 


LINE = 
P_SAVED = 
LINE_END 


Names referenced Name 


REVERSE 
PAD 
SUBSTR 
MINP 
BNORM 
IF_BSPACE 
HYPHENATE 
USCORE * 
BSPACE 


Type 
Function 
Function 
Function 
Function 
Function 
Pattern 
Function 
Character 
Character 


:S (FRETURN) 


: (LINE_USCORE) 


Program 


Program, 


Program 
Program 
Program 
Program 
Program 


x indicates name is referenced in the initialization section. 
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{| .-Prograrm {1 PAD(S,CW) will add or delete blanks from the 
tt 10.4 i! string S as necessary to adjust the spacing 
4 PAD 1 of S to equal CW. When blanks are added they 
Dan a ne are not always added from the same direc- 
tion. Otherwise the process would tend to produce more white 
area on one side as opposed to the other. White areas running 
vertically down the page are termed rivers and large bodies of 
white areas are termed lakes. It is good formatting practice 
to prevent rivers and lakes from forming. 


The writing of PAD is greatly simplified by the assumption 
that S is B-normalized and contains no overstruck blanks (a 
fact assured by the activity in LINE_INIT). This implies that 
every blank separates 2 balanced substrings and so blanks may 
be inserted without causing misalignment of overstruck 
characters. , 


Ce ee re te ee eN 
{ PAD(S,CW) will add or delete blanks to the string S to | 
| make it conform to a column width of CW. ] 
| CSE SE ea a Se SP ESE a ee TE a EN CE Ee A eR EN | 


DEFINE ('*PAD(S,CW) I, K,T,N*) 


CQ te ee ee ee ee ee ee 
{ This pattern looks for the first blank which is not ina | 
{ sequence of initial blanks. | 
Mcconnell polo eet ite ri tpi i ii nse aia tiie 
INTERIOR_BK = ((SPAN(* *) { NULL) FENCE BREAK(' ')) .T 

. : (PAD_END) 


OTe a ee ee ee Pe nee RE Ee PE ae ee ee ap ee eee 
{ Entry point: Determine the number of blanks (N) to be ad- | 
_{ ded. Branch to PAD_REDUCE if N < 0. { 
ae ceeeerensnrersnsnsasesceneesst pss sr -FrSr h ss-t-sst a  P l nunssspasussisssneneemmnannicirasnell 
PAD N = CW - SPACING (S) 

PAD = LE(N,0) Ss :S (PAD_REDUCE) 


SSS ee 
{ First insert a blank at a statement separator if any { 
Ma i an ean aa ih aa ag ar rere 


Ss rf = to :F (PAD_1) 
N = N-1 
PAD = EQ(N,0) §S _ ¢S (RETURN) 


ee a Ne ee Reg ee Ne Sa Oa et ee ne ee Fe eg age ce ee 
{ PAD_RT is a flag to indicate whether padding should begin | 
{| from the right (=1) or from the left (=0). ] 
near sae ntti Ss st rs eee ss pines tnt inthe pn Tun cb asemnsiaramssnesasiesinemiwal 
PAD_1 S = EQ(PAD_RT, 1) REVERSE (S) 


Gyn cp oy ee Se er Pee ge nC RE Se SP ee Re ST PS Tae Se he eee he Ee oe 
| Inner loop: Remove a prefix from S at an internal blank. | 
{ Place it onto PAD with an extra blank. Keep looping until | 
1! N is reduced to 0. | 
ees cinerea nents ite t-te ss ir a asemeesteieimeen nantes enaannetsvemnrassatmoaiiacasereD 


PAD_LOOP S INTERIOR_BK = :F (PAD_AGAIN) 
PAD = PAD T ' # 
N = N- 1. GT(N,1) :S(PAD_LOOP) 


Cr ee Te Rp ae ee a Se ee pie eg ee a ee TT ee 
{| Falling through indicates completion. Append S; reverse | 
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{ if necessary; change flag for next time; and return. { 
irene spire ii mi liga i ees isi satiate 


PAD_DONE 


PAD = PAD S§S 
PAD- = EQ(PAD_RT,1) REVERSE (PAD) 
PAD_RT = 1 - PAD_RT : (RETURN) 


Be eh ce ee en ee ee ee ee ee ee me pe ee ae ee ee ee ee 
| Here if no more holes remain. If PAD is null at this point | 


{ return; there are no holes. Otherwise restore PAD and S. | 
a ee ee ed) 


PAD_AGAIN IDENT (PAD) :S (PAD_DONE) 
S = PAD §s 
PAD = : (PAD_LOOP) 
pone 


{ Here to remove N characters. { 
Wren ninco si inert tense iliac lisesi lms Sei caine 


PAD_REDUCE N = LT(N,0) N+ 1 :F (RETURN) 
PAD fo eN ys ee  S : (PAD_REDUCE) 
PAD_END 
Names_referenced Name Type Where defined 
by PAD: SPACING Function Program 10.5 
REVERSE Function Program 3.6 
Epilogue 


The design of PAD was based on the assumption that N is small 
compared with the size of S and indeed that N does not usually 
exceed the number of blanks in S. If this were not the case 
then a more efficient procedure would be to make one pass 
through to determine the number of blanks in S, compute the 
number of blanks to be inserted and, in this way, accomplish 
the insertion in 2 passes. 


The method given saves the initial pass of counting the number 
of blanks in S and is very much more efficient when 0, 1 or 2 
blanks are to be inserted in S. 


~ , 

{{ Program {f{ SPACING(S) will determine the spacing of the 
1 10.5 tl string S. If S has been B-normalized this 
{1 SPACING {| will yield the number of print positions oc- 
.—____________J cupied by the string. 


Ce ee ee ee ee te ee ee ne a en a ees = ee 
{ SPACING(S) will return the spacing of the string S. | 
Ed A I yeaa eR Ne ee Eee Ee eI EE EL een me SREP aT RNAS TO NTO 


DEFINE (' SPACING (S) ') 
IF_OVERSTRIKE = BREAK(BSPACE USCORE) 
: (SPACING_END) 


er ee ee ee ge ee ee ee ee 
{ If no special characters exist, just return the number of {. 
{ characters in S. { 
Weasel agai an ole esa ipa anne neti 
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SPACING SPACING. = SIZE(S) 
s IF_OVERSTRIKE : F (RETURN) 


cnn = TY 
{| Otherwise deduct 2 for each backspace and one for each | 


{ underscore. 1 
| ne | 
SPACING = SPACING - 2 * COUNT (S,BSPACE) 
+ - COUNT (S,USCORE) : (RETURN) 
SPACING_END 
Names referenced Name Type Where defined 
by_ SPACING: COUNT Function Program 3.4 
BSPACE * Character 
USCORE * Character 


* indicates name is referenced in the initialization section. 


Epiloque 


The two calls to COUNT do not render the most efficient coding 
but the convenience and the fact that overstrike characters 
are relatively rare suggests its use. 


Eee eee 

{{ Program || MINP(S) will return the minimum number of 
| 10.6 (t print positions needed to print the string 
11 MINP its. 

| ae | 


DEFINE (*MINP(S) T*) 
: (MINP_END) 


SS SS SS 
{ Entry point: if JUSTIFY is 0, the contraction points are | 
| ignored. Just return SPACING in this case. { 
cesses sess sts fs ss SS S-S S  GlvsihfonVSN-espesessesenenesell 
MINP - MINP = SPACING(S) 

. EQ(JUSTIFY, 0) : S (RETURN) 


GF nee ee ee a a ee ee ee ee ae ee 
{ Reduce MINP by one for each contraction point found. ae | 
(oceans emery ss sistas se eeansenesunnaasinnnsnemmnniensssesnesesei 


‘MINP = MINP ~- COUNT(S,* ‘) : (RETURN) 
MINP_END 
Names_referenced Name Type Where defined 
by MINP: SPACING | Function Program 10.5 
COUNT Function Program 3.4 


JUSTIFY Global Flag 


CS ee ee 

tt Program tt Hyphenation, while not strictly necessary, 
{{ 10.7 | serves to eliminate rivers and lakes in 
{{ HYPHENATE Jf documents with right edge allignment. This 
t_—___-_-—--____--I is particulary true with small column 
widths in which the same amount of expansion is concentrated 
in relatively few gaps. An exact algorithm for hyphenating 
words does not exist short of storing large numbers of special 
cases. In the extreme, a complete dictionary could be stored 
but such a massive amount of information would have to be 
placed on secondary storage since it would be uneconomical, if 
not impractical, to store the dictionary in high-speed 
storage. But ‘secondary storage is unsuitable to this problem 
since accesses must be made frequently (almost once per line). 


The algorithm we will present will not depend on dictionary 
methods other than that a relatively small number of suffixes 


must be stored. Its error rate is low but not zero. For- 
tunately, no great tragedy befalls if an occasional word is 
mishyphenated. In the last analysis it becomes a balance of 


aesthetics. How many lakes and rivers are worth how many 
mishyphenated words. 


Perhaps the simplest published hyphenation algorithm appears 
in Rich and Stone [1965]. The basic method involves examining 
pairs of letters out of context and deciding whether this pair 
is or is not suitable for hyphenation. This algorithm turns 
out to be too weak (not enough break points are discovered) if 
too few letter pairs are permitted, or too erroneous 
(producing a break at a non-syllable boundary) if too many 
letter pairs are dubbed as breakable. Letter pairs do not 
hyphenate uniformly enous to be used as a sole guide for 
hyphenation. 


The program given here is based on an algorithm developed by 
M.R. (MOlly) Wagner [1971] for incorporation in a text format- 
ting program called Roff [McIlroy 1971]. Wagner extended Rich 
and Stone's work to include an examination of suffixes before 
looking for letter pairs and also greatly reduced the number 
of letter pairs considered breakable. With these improvements, 
the error rate has been reduced to the neighborhood of 1% and 
the number of hyphenation points found, while far from total, 
is nonetheless satisfactory. This book uses the hyphenation 
algorithm described, with the proviso that the user can over- 
ride the automatic hyphenation of specific words. Very few 
overrides were required. 


Most hyphenations found are by suffix removal. Three distinct 
kinds of suffixes are defined. A hyphenating suffix is one 
before which one can hyphenate. For example ‘less' and ‘ness! 
are both hyphenating suffixes. If "carelessness' is to be 
hyphenated with room for only 6 characters the ‘ness’ is 
stripped off first. There are still too many characters and 
so the ‘less! is stripped off. The word is then hyphenated as 


'care-' on one line followed by ‘lessness' on the next. An 


inhibiting suffix is one which is not hyphenated and, 


moreover, upon encountering one, the suffix hunt is given up 
and letter-pair (or digram) testing ensues. For example, ‘ing' 


is an inhibiting suffix. If it is detected as in ‘winning! 
the suffix is stripped and digram testing begins with the 
double-n. This digram is breakable so that the word is 


hyphenated ‘win-ning!. Also, an inhibiting suffix will ab- 
solutely prohibit hyphenating at a point where digrams might 
indicate that hyphenation is allowed. Otherwise telse® might 
be hyphenated ‘el-se'. A neutral suffix is one which is not 
hyphenatable but, unlike the inhibiting suffix, does not 
Signal the start of digram testing. More suffix removal can 
take place. For example ‘'es*' is a neutral suffix. In 
"harnesses! the ‘es' is stripped and a further suffix search 
yields ‘ness' as a hyphenating suffix. The word can therefore 
be hyphenated as 'thar-nesses'. 


The second phase is digram testing. Here we find the in- 

teresting phenomenon that most letter-pairs are considered 
hyphenatable whereas most pairs of letters that actually ap-— 
pear within English text are not. For example, every digram 
of the form consonant-vowel is non-separakle unless’ the 
consonant is 'x', Also every digram of the form vowel- 

consonant is non-separable unless the consonant is 'q‘. But 

these pairs so predominate in English that it is not hard to 

find words in which no breakable digram appears; ‘hyphenate! 

itself is one such word. 


Finally, we insist on at least one vowel before and after the 
break. This is so that we do not hyphenate words like ‘bless! 
which only appear to have a hyphenating suffix, or words like 
*returns' which would otherwise be hyphenated ‘retur-ns'. Also 
we do not hyphenate words with strange characters in them 
other than certain leading and trailing punctuation and an 
initial capital. Otherwise, paragraphs like this and the last 
two might prove awkward to decipher. 


SSS 
-HY PHENATE (RWORD, MIN) will indicate where within the rever- 
sed word (RWORD) a hyphenation point can be found. MIN: 
indicates the number of characters by which the word must 
be diminished in order that the line may include this 
word. A global variable, HYPHEN, will be set to '-' if a 
hyphen must be added to the word. HYPHENATE will fail if 
no hyphenation point is found. As an example, HYPHENATE( 
*niatbo',3) will just succeed and return a value of 4. 
HYPHEN will be set to '-'. The 2nd argument may be < 0 in 
which case the first nontrivial hyphenation will be found. 

‘ _ eae eee 

DEFINE ( *HYPHENATE (RWORD, MIN) K,C, L') 


om | 


INHIB_SUFF, NEUT_SUFF, and HYPH_SUFF corresponding to the 
3 types of suffixes mentioned in the text. They will be 
applied to a reversed version of the word to _ be 
hyphenated. 

Na sititceneprses bispecific to nile ati ites ida rose emi 


say 
Initialize suffix matching patterns. Construct 3 patterns | 
{ 
l 
{ 
i 
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INHIB_SUFF = OR (UPLO(BALREV ('ED, (GLSV)E, (GQ) UE, ING, EST, '))) 
NEUT_SUFF = OR(UPLO(BALREV( ' (AI) BLE,LY,S,ES," ))) 


+ { ANY (*.3,:7?) *) 
HYPH_SUFF = OR (UPLO (BALREV ( 
+ 'TURE, (CGST) IVE, (CDMNT) IAL, FUL, (CGST) IAN, ! 
+ ' (CGST) ION, SHIP, (LN) ESS, (CGST) IOUS, (CDGLMNTV) ENT, * ))) 


{ DIGRAMS is a string representing all letter pairs which 

{ are regarded as breakable. Thus ‘'xat is a breakable pair. 

{ '@' stands for the set of vowels (aeiou) and '~' stands 

{ for complementation. Hence ‘'-3(@)B' means that all 

{ consonants followed by a 'b* are breakable; also '-~(@NS)C! 

| means that any vowel, 's' or 'n', when followed by a 'c! 

{ is NOT breakable. 

| A ee LR ae A aE ERT TT | 
DIGRAMS = 

+ 'XA,-~ (8) B,~ (@NS) C, + (@R) D, XE,~(0) F,~(@N) G, +(@CGPSTW) H, XI, 

+ 1 (8) J, (@CLNS) K, ~ (@BCFGPTY) L,-~ (@Y) M, + (@GKSY) N, (AX) 0, ! 

+ '- (@SY) P,+(S)Q, (JKLMNRSVXZ) R, 7 (@KLNWY) S,~ (@FHSY) T, XU, ' 

+ '~ (0) V,~>(@S)W,~ (0) X, (QWXY) Y,-~ (@C) Z' 


rt rg ee ee ge a ae ee Se TR ee ee Te ee a ee a oe 
{ Convert @ to vowels, and find complement if ~ is present. | 
Msc kite ier iirc tinny peep eel a ie noah nmiianibansell 


HYPH_D1 DIGRAMS ‘@' = ‘'AETOUt :S (HYPH_D1) 
HYPH_D2 DIGRAMS '-' BAL. T = ‘'(' DIFF(UPPERS_,T) ')' 
+ 2S (HYPH_D2) 


en een | 
{ Convert to lower case and reverse to make scanning easier. | 


{ Then prepare a table (DIGRAM_TBL) of all those breakable | 


| digrams. | 

een nc i ets ee ie rma ais essen mecca 
DIGRAMS = BALREV(UPLO( DIGRAMS )) 
DIGRAM_TBL = TABLE (30) 

HYPH_D3 DIGRAMS LEN(1) . C 

+ ("(' BREAK(')") . CC 'y* | LEN(1) . CC) 

+ (*,* ( RPOS(0)) = : F (HYPH_D4) 
DIGRAM_TBL<C> = ANY (CC) 3: (HYPH_D3) 

HYPH_D4& 


Ne ee em ee ge ee ee et eee ey oe et ee a 
{| HYPH_PAT is the chief hyphenating pattern combining all | 
{ previous patterns into one. It will look for a break at | 
{ least MIN spaces from the back of the string and will set | 
{ K to equal the break point. ( 
Ma asetnesnnt-esie> snappers eis snh ests ts hss Fs ets tl r-hhesssS unruaeresenassosaell 
HYPH_PAT = HYPH_SUFF @K (*GT(K,MIN) { FENCE *HYPH_ PAT) 
+ | NEUT_SUFF FENCE *HYPH_PAT 
+ { (INHIB_SUFF | NULL) FENCE ARB LEN(1) $C @K 
+ , *GT (K,MIN) *DIGRAM_TBL<C> 


ee ee ee a ee ey ee ee ee eS ee ee ee ey ee 
{ Other miscellaneous patterns follow. { 
macht ernie in ainsi ert isha ties tna ic avarenereeasateiaianammaitaal 


TRUE_LWORD = POS(0) (ANY('.;),22?") { NULL) 
+ SPAN (LOWERS_ '-') (ANY(UPPERS_ '(') § NULL) RPOS(0) 
FIRST_VOWEL = BREAK(UPLO( 'AEIOU' )) LEN(1) @L 


FOLLOWING_VOWEL = POS(0) TAB(*K) BREAK(UPLO(‘tAETOUY')) 
: (HYPHENATE_END) 
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SS ES OOS aS SP ATS SORT SHEE ATE 


Ce oe ee ee ee a eRe, ae ee ge ee ea en Te ae 
{ Entry point: Check to see if a normal word is there. Set | 
{| MIN to be at least beyond the first vowel. { 
Nanci iii iis hei episodes se iii hana cnecmmmssaiiiamaemecll 
HYPHENATE 


RWORD TRUE_WORD :F (FRETURN) 
RWORD 't-t :S(HYPH_1) 
RWORD FIRST_VOWEL :F (FRETURN) 
MIN = LT(MIN,L) L 


Cr ee ne a TO Cn ae ee ne a eT a a ee Ae Te ee re eee ee 
{ Scan for a hyphenation point; check for following vowels. | 
{ Insist on more than one character preceding the hyphena- { 
| tion point. | 
eae ee Ee a Ee a eS | 


RWORD HYPH_PAT :F (FRETURN) 
RWORD FOLLOWING_VOWEL :F (FRETURN) 
LE(SIZE(RWORD) - K, 1) :S (FRETURN) 


ee ee ee 
{ Return K and set HYPHEN to a '-*. ] 
| CESSES Scene nae a A ee En a EE Ee EE ee 
HYPHENATE = kK 
HYPHEN = 'f- * : (RETURN) 


er ee ee ee ee a ee ee 
{ If the word already contains a hyphen, this is the only | 
| point at which it may be hyphenated. { 
esr =P SoU = sss Susser str nsus On-aSueSPSSSPSoovaanmncomenl 


HYPH_1 HYPHEN = 
-RWORD ‘= @K *GT (K, MIN) °F (FRETURN) 
HYPHENATE = K- 1 3: (RETURN) 
HYPHENATE_END 
Names_referenced Name Type Where defined 
by_ HYPHENATE: BALREV * Function Program 3.8 
OR * Function Program 8.9 
UPLO * Function Program 2.1 
DIFF * Function Program 3.10. 
UPPERS_ * String Program 2.1 


* indicates name is referenced in the initialization section. 


Epilogue 


The coding of HYPHENATE was based on the desire to make it 
easy to see and modify the suffixes and letter pairs on which 
the algorithm is built, but at the same time to produce an ef- 
ficient subroutine. The suffixes and digrams have therefore 
been transformed by the initialization section from a viewable 
format to a swiftly runnable one. The result of the pre- 
computing is a single pattern (HYPH_PAT) used to scan. the word 
in reverse until a hyphenation point is found in which case 
the variable K is set or is not found in which case the pat- 
tern fails. Suffix testing and removal are done by essentially 
ORting the various suffixes together with an appropriate 
degree of sophistication as contributed by the function OR 
(Program 8.9). OR contributes to efficiency by consolidating 
strings beginning with the same first character. 
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Digrams are done a little differently. One could have taken 
the OR Of all breakable digrams to produce a pattern of the 
form 


ta" ANY(...) { '"b' ANY(...) | 'c! ANY(...) { ee. 


This would require 26 tests for each character within the WORD 
to be hyphenated until a break point was found. A more direct 
approach is a variant on the pattern 


LEN(1) $ C *DIGRAM_TBL<c> 


where the search through 26 alternates is replaced by the 
lookup in the table. Since the look-up is done by hash coding 
it can and is accomplished faster than ORing. 


But it is interesting to note that it is not a great deal 
faster. Evaluating an unevaluated expression requires suf- 
ficient time that the tradeoff in speed occurs at about 10 
alternands. If the pattern were intelligent enough not to take 
alternatives after once finding a character it would avoid 
some needless testing and the average number of trials would 
be 13, not 26. Moreover, if the sequence of characters is ar- 
ranged in order of the frequency of their appearance in 
English, we may expect to wait on the average of perhaps only 
6 alternands. This suggests a pattern of the form 


"et FENCE ANY(.--) | ‘t* FENCE ANY(..-) [| eo 


This pattern is slightly more awkward to use since it will 
succeed or fail at the first character position. It must be 
moved against the subject string by explicit programmer com- 
mands. Since the speedup of this approach cannot be great (if 
even positive) we leave its encoding as an exercise. 


Ra RD, | 

{{ Program {f{ Printing a line which contains backspace 
tt 10.8 | characters is not easy using a standard line 
| IMAGE tt printer. In fact, it is not immediately 
La clear how we can even package this activity. 


We certainly would like to focus all print line extraction in- 
to a single function. But what is this function to return? 
If the function were to go ahead and print the line, complete 
with overstrikes, we would not have a very flexible function. 
Since we have no idea of the use that is to be made of the 
line it would be rather poor practice to commit ourselves in 
advance to any particular disposition. We could return a 
linked list of lines, one for each overstrike or a string of 
consecutive lines (assuming we know the line width these could 
be later separated) but these 2 methods imply the necessity of 
disentangling the strings once they were brought back, a 
process easily enough done but just as soon avoided if 
possible. Rather than return all the lines at once we will 
have IMAGE return just one particular line, the line numbered 
I. This will help us in 2 ways. Not only will it be easier 


Page 216 Chapter 10 - PARAGRAPH FORMATTING sees 


to use in the normal case, but it will provide us with random 
access to certain levels of lines. If, for example, we inter- 
pret the 3rd overstrike as actually a superscript, we could 
print that line first before going on to the others. 


IMAGE(S,I) will return the Ith overstruck image of the B- 
normalized string S; for I=1 the line proper is returned, for 
I=2, the set of first overstrikes is returned, for I=3, the 
set of 2nd overstrikes, etc. For I=0 the underscoring of sec- 
tions set off by USCORE's is returned. If IMAGE(S,I) does not 
exist for some I, the function will fail. Note that for I=1 
the function never fails. 


For example, let 


S = "THE WQUICK BRO<«/WNH FO</xX'! 


then 
IMAGE(S,0) = °¢ a _ * 
IMAGE(S,1) = ‘THE QUICK BROWN FOXx' 
IMAGE(S,2) = °* / yd 
IMAGE (S, 3) fails 
Printing a line reduces to the following program. First we 


associate OVER with a format which insures overstriking. 
(PRINTER is a variable designating the printer wit, is 
installation dependent, and must be given by the user.) the 
width of the printer is assumed to be 132. 


OUTPUT (.OVER, PRINTER, ' ( 1H+, 132A1)') 


OUTPUT = IMAGE(LINE, 1) 
I= 1 
LOOP I = I+#1 
OVER = IMAGE (LINE, I) :S (LOOP) 
OVER = IMAGE(LINE,0) 


Note that nonhing is printed in a statement in which IMAGE 
fails. 


Even this activity, however simple and straightforward, can be 
avoided if we had the ability to return a data object having 
more dimensions that the singly dimensioned string. Such data 
objects exist; for example an extended version of SNOBOL4, 
called SNOBOL4B [Gimpel 1972], has a 3-dimensional aggregate 
of characters as a special datatype (called a block). The 
system which produced this text was written in SNOBOL4B. In 
this system not only does a function return an overstruck line 
as a value but there exists a function called TYPSET which 
returns an entire paragraph complete with overstriking. 


te tt tree ae eS RES cee wee aon 


Go race ee em ge ene eI gre ee eee Te eee ET ee we ee ge eee eee Ae ee ee ee 
| IMAGE(S,I) will return the Ith print line associated with | 
{ the string S. It will fail if there is no Ith line. S is | 
{ assumed to be B-normalized. { 
Ua aces ec espana ict natant 

DEFINE ('IMAGE(S,1)C,BU,T,T1") 

IF_OVERSTRIKE = BREAK(BSPACE USCORE) 

IF_BSPACE = BREAK (BSPACE) 

IF_USCORE = BREAK (USCORE) 

: (IMAGE_END) 


Fe ee ee ee 
{ Entry pcint: Fan out to various locations depending on |{ 
{ value of I. ( 
ee a a a NE EE ES ES 
IMAGE LE (1, 0) :S (IMAGE_USCORE) 
GT(I,1) :S (IMAGE_BSPACE) 


eee mImn SOO 
{ I = 1: Ignore USCORE's, BSPACE's and characters following | 
{ BSPACE's. | 
a ee ee 
IMAGE = § 
IMAGE IF_OVERSTRIKE : F (RETURN) 
IMAGE_1 IMAGE BREAK(BSPACE USCORE) . T 
+ (USCORE {| LEN(2)) = T  :S(IMAGE_1)F (RETURN) 


rr ge a ne ee ee ge ee Stee Ee PE ee nee eee ee 
{ For line 0 come here. Make fast scan for USCORE failing | 
{ if none exists. BU will be a convenient abbreviation for | 
{ BREAK (USCORE). Replace all up to the first USCORE by | 
| blank. Replace material between USCORE's by '_'s. 1 
cs ete seems i meen arsine tbr es tei iat ii rcs amis ie ii ns eianaaacaiomnceal 
IMAGE_USCORE 


Ss IF_USCORE :F (FRETURN) 
BU = BREAK (USCORE) 
IMAGE_UL 
Ss BU . T USCORE (BU . T1 USCORE { REM. T1) = 
IMAGE = IMAGE DUPL(' ',SPACING (T) ) 
+ DUPL ('_',SPACING (T1) ) 
Ss BU :S (IMAGE_UL) 
IMAGE = IMAGE DUPL(' *,SPACING(S) ) : (RETURN) 
Se Se eee 
{ For I > 1 come here. Set up pattern PAT.C specially com- | 


{ puted for level I. ! 
a a RR a I eC eR ae a EEN | 


IMAGE_BSPACE S  IF_BSPACE :F (FRETURN) 
PAT.C = BSPACE LEN(1) .C 

IMAGE_B1 I = I-11. G(I, 2) :F (IMAGE_B2) 
PAT.C = BSPACE LEN(1) PAT.C : (IMAGE_B1) 


ne ee ER a ee ee ee ey ee re ee Oe ee 

{ See if an Ith overstruck character exists. Set it to c if | 

{ it does. { 

jE ES Se ae nen Se ER a EE ae Pe TET EE EE eS PR COTS ee EET AE EE | 

IMAGE_B2 S  POS(0) BREAKX(BSPACE) . T PAT.C = 

+ :F (IMAGE_B3) 
IMAGE = IMAGE DUPL(' ',SPACING(T) - 1) Cc 


ee end 
| Now remove any remaining BSPACE's. If the right neighbor | 
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ed 


| does not exist we are free to return. . { 
nn eer cement cen rnc ne essen enptanenatsetiaasianrensnsninscnetnrenenemanenmnemsesacnll 

S POS(0) ARBNO(BSPACE LEN(1)) -NOTANY(BSPACE) . C = C 
+ :S (IMAGE_B2) F (RETURN) 


a SER 
| The clue to whether any characters at level I exists is | 
{ found in IMAGE. I£ it is still null no Ith level charac- | 
{ ters have been found. { 
ae ca pc opp ai nt ve i ii an sc i mm i gar pina ar ecco nnmimetaseel 


IMAGE_B3 IDENT (IMAGE, NULL) 7S (FRETURN), 
IMAGE = IMAGE DUPL(' ',SPACING(S)) 3: (RETURN) 
IMAGE_END 
.Names_ referenced Name Type Where defined 
by IMAGE: BSPACE * Character 
USCORE * Character 
SPACING Function Program 10.5 
BREAKX Function Program 8.2 


* indicates name is referenced in the initialization section. 


ESSE ERE EE EEE ESSE SEES AE EERE SEES EEE SEES EE SEES eee a 
PP 22IPPAIIIPIVVIVIIIAZIZ7F ~=EXERCISES 2?7277722272222272 22222272227? 
PPPPPZPPPPPPZ PAP APPPPPPP. Ae a EE ae A a a aE a A a a a a a 


a aa a a 

{ Exercise 10.1 | Modify BNORM so that it fails if a B- 
LJ. normalized version of the string does not 
exist. 


Cal ote ee ee ee 

{ Exercise 10.2 {| Prove that if S, and Ss, are B-normalized 
L-__—___-_———-J_- then the concatenation Sy, Se is B- 
normalized. 

oes Sa i te 

{ Exercise 10.3 { The text says that in order to have an in- 
L_____——.__--—-J. version in the print position numbers we 
must have at least one double BSPACE. Intuitively this is ob- 
vious. Can you prove it? 


————— 
| Exercise 10.4 | Prove that step (ii) of the BNORM algorithm 
U——___--—_____-——1 (Prog. 10.1) preserves the property of 
being right-balanced. 


i ee a | 

| Exercise 10.5 { Suppose string S, prints the image I, and 
_-__.________-__j string -Sg prints the image I>. Write a 
pattern-matching statement to determine whether the image I, 
is a subimage of I,. 
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Coe re eee NN 
{| Exercise 10.6 | Modify INORM to process separately the case 
t-____-_____-_.-_-J of a single overstrike. 


CS rae ee 

{ Exercise 10.7 | Rewrite PR_POS (in INORM, Prog. 10.2) to 
-_____________J. use BREAK rather than ARB to find a BSPACE. 
Assume the string to be matched is B-normalized. 


he eT, 
{ Exercise 10.8 { (a) How would the definition of 
t__-_.._________1 distinguishable change if overstrikes of 
the same character are not regarded as different? 


(b) How would the definition change if all nonprintable 
characters were regarded as blank? Assume the nonprintables 
including blank are contained in the string NONP. Also do not 
make the assumption in (a). 


(c) How would INORM be modified in each instance 


aa ag Lara eT | 

| Exercise 10.9 | (a) Modify LINE so that the cost (UF) of 
t.___—__—__——_—_————--JI._ compressing a line be two per char, while 
the cost of adding a blank and hyphenating remain at 1 (re- 
quires modifying one statement). (b) Modify LINE so that the 
cost (per char) of compressing a line is UF_C, the cost of 
padding is UF_P and the cost of hyphenating is UF_H. 


Cee pe a ee ee ee ee NN 

{ Exercise 10.10 { Modify PAD (Prog. 10.4) and MINP (Prog. 
t-———________-___-——J. 10.6) so that any blank following a spe- 
cial character can be squeezed out. An example of a set of 
special characters is ',)3:(;'. 


ee a en ee 
{ Exércise 10.11 {| What is the value of HYPHENATE(RWORD, K) 
LI for K = 2, 4, 6, 8 where 


(a) RWORD = REVERSE ('tinvestment ') 

(b) RWORD = REVERSE ('co-operation') 

Cr oo, a ee 

{ Exercise 10.12 { Modify HYPHENATE so that it will use not 
J only '-' as a break character but any of a 
set of characters in the string BRC. Slash (/), for example, 


might be such a character to be broken in phrases such as 
‘input/output'!. 


CS ee ee ey eee ; 

{ Exercise 10.13 | Modify the hyphenation algorithm so that 
_________ sh digrams are tested in the order of the 
frequency of letters in English ('etoanirshdlcwumfygpbvkxqjz') 
and such that testing at a particular position ceases when the 
letter is found. 
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Co pe ae PP ae ee ae 

| Exercise 10.14 | Modify HYPHENATE so that any word 
L____--____-—1 consisting entirely of upper case letters 
will also be hyphenated. 


Col a ee ee 

{| Exercise 10.15 | (a) Write a function PRIMAGE(S) which will 
t______________J._ _ print the image of the B-normalized string 
S. (b) Given 2 strings, S1 and S2 use PRIMAGE to print them 
on the same line with S1 beginning in column 10 and S2 begin- 
ning in column 60 (assume the spacing of S1 is less than 50). 


Ce en ne : 

{ Exercise 10.16 {| Using PRIMAGE() of the above exercise, 
td print the B-normalized strings $1 and S2 
on the same line. That is, overstrike one on the other. 


C.-T ee 

| Exercise 10.17 {| Playboy magazine, for reasons best known 
t______.___.--J to itself, wishes the lead page of the 
Playboy pictorial to be laid out in a ‘coke bottle't shape. 
Assume the line widths, ranging froma maximum of 36 toa 
minimum of 22 are contained in a string (LENGTHS) separated by 
commas. Assume the lead paragraph is in a variable P. Assume 
a page width of 60 with the column centered in the page. Using 
the function PRIMAGE from Exercise 10.15 write the SNOROL4 
program to satisfy Playboy's request. 


Cos oe a ene en 
{| Exercise 10.18 | Suppose that the 3rd overstrike represents 
L___________--J_  superscripting and the 2nd overstrike 
represents subscripting so that 

‘A +1 = 2 + <N! 


prints as 


Using IMAGE, print such an object. 


CSS 
{ Exercise 10.19 | Print a string with exponentiation such as 
a irititinineeeninimieall 


‘Ax* (M41) = B¥*N + CHEMI 


in such a way that parenthesis (if any) are stripped from the 
exponential and the exponents are superscripted such as 


Assume that the string contains no BSPACE's and whenever '**! 
appears it means superscript the following character unless a 
'(' appears in which case the parenthetical expression is 
superscripted. Assume that the superscript does not itself 
have superscripting. (Hint: this can be done in four state- 
ments using IMAGE and BNORM). 


( ee en nr meeran | 
| Exercise 10.20 { Extend the previous exercise to handle ar- 
t___.__.__--.-___--—5 bitrarily nested exponentiation. 
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eon" 
Irnt ne of the reasons for writing in a higher level 
{ft tf language is to free oneself from the entanglements of 
tt ¢{ individual bits and the sometimes sordid details of the 
{| 


Particular machine on which one is running. A price is 

u——J normally paid for this in terms of time and/or space 
‘efficiency of the resulting program but one is presumably wil- 
ling to pay this price if the savings in programming time are 
compensative. Then why, the reader may ask, should we bother 
about timing and implementation since the former we have 
agreed is relatively unimportant and the latter represents 
detail from which we wish to escape? The answer is that al- 
though most programs are small and can (and should) be written 
without regard for the time they consume, most large programs 
come to grips with the efficiency question sooner or later. 
Large programs may exceed critical storage bounds or they may 
consume so much time that their utility is in question. Some 
knowledge of timing is useful not only to improve the speed of 
an existing program but to estimate the cost of running 
programs not yet written. It may well be that a program writ- 
ten in SNOBOL4 will ke too slow or inefficient for a given 
application and it will ke helpful to learn this before it is 
written. 


Describing a system as large as an implementation of the 
SNOBOL4 language can neither be easy nor quick. To make mat~ 
ters even more difficult there are several SNOBOL4 processors. 
There is the original MAcro Implementation of SNOBOL4 
{Griswold 1972] which we refer to as MAINBOL, there is a com- 
piler version for the IBM 360/370 called SPITBOL [Dewar 1971] 
and a small fast interpreter for the PDP-10 called SITBOL 
(Gimpel 1972, 1973a]. In addition, the macros of MAINBOL have 
been expanded to run on several different machines including 
the IBM 360/370, CDC 6000, Honeywell 635, Univac 1108 and the 
PDP-10. The process of macro expansion for yet newer machines 
continues at this writing with unabated ferver so that this 
list is not, and is not intended to be, exhaustive. 


The primary purpose behind SPITBOL was speed and the resulting 
system is 7-8 times faster than MAINBOL. SITBOL's chief 
concern was storage and the system is less than one-third the 
size of MAINBOL. In spite of the differences in design goals, 
the implementations of these systems are fairly similar. 


Ce en eee 

{ #%%% ymbol Tables | A symbol table is programmer jargon for 
1% cae —S—iaA:«Cttabidle=« «oof )«€(6information that can be 
{ 488% | referenced on a name basis (the symbol). For exam- 
{ #{ ple, a telephone directory can be regarded as a syn- 
{| 88% | bol table of sorts where the symbol is a person's 
J name and the information to be looked up is his tel- 
phone number (and possibly other information such as his 
address). In principle, a symbol table could be implemented 
as a long list and a search could be made by comparing a given 
symbol with every one on the list. This is obviously too 
inefficient to be practical. In the telephone directory, the 
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symbols are arranged alphabetically to permit rapid searching. 
In general, a symbol table is organized in such a way as to 
avoid a lengthy linear search. 


A common method of implementing a symbol table is by means of 
a hashing technique, illustrated in Figure 11.1. The Hash 
Array is a fixed-length array of pointers to symbol table 
entries. Each symbol table entry contains the name of the 
symbol (for comparison purposes), information associated with 
the symbol and a pointer to the next symbol table entry (if 
any). Hence, each pointer in the Hash Array may be regarded 
as heading a list of symkol takle entries. 


When a symbol such as ALPHA is looked up or entered into the 
table, a so-called hash number is computed from the characters 
‘ALPHAt which is a number ketween 0 and L-1 where L is the 
length of the Hash Array. This hash number is used to 
reference into the Hash Array and hence it designates a list 
of symbol table entries. If a symbol table entry for ALPHA is 
in the table, it must be in this list. Thus the time to locate 
ALPHA in the table is reduced by a factor equal to 1/L but is 
increased by the time needed to compute a hash number. 


The hash number must be reproducible so that given the charac-~- 
ters ‘ALPHAt the same hash number is always produced, but the 
method for computing the hash is otherwise arbitrary as its 
name would suggest. It should provide a good mix so that all 
locations in the Hash Array (sometimes called buckets) are 
referenced with approximately equal probability. Also the 
computation should be quick. For example, one may take the 
first 4 characters exclusive-OR'ed with the last 4 characters 
and divide by the length L of the array. The remainder is 
usually an acceptable hash number. Note that the hash number 
does not uniquely represent the symbol. In Figure 11.1 both 
ALPHA and GAMMA have the same hash number. 


Symbol tables are very important; they form the heart of vir- 
tually every assembler, compiler and interpreter. A symbol 
table provides the link ketween an external name (symbol) and 
an internal block of information about that symbol. One need 
merely reflect on the telephone directory example to see the 
importance of this. Names in a program remain fairly stable 
even though they may translate into different internal ad- 
dresses from run-to-run just as people normally retain their 
names even though they may be associated with different 
telephone numbers over the course of their lifetime. 


For SNOBOL4 implementations, the information typically 
retained in the symbol table entry for, say, ALPHA is the 
value of the natural variable ALPHA, a pointer to function in- 
formation if ALPHA is a function and a pointer to an internal 
code location if ALPHA is a label. Also, if ALPHA is a keyword 
(it is not) information may be present to indicate its value. 


For interpreters with the power of SNOBOL4, the symbol table 
is especially important; it remains in core during execution 
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Figure 11.1 
A symbol table containing three symbols ALPHA, 


BETA, and GAMMA. 
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and there are language features which depend on this. For ex- 
ample, indirect referencing, such as: 


A = ‘ABC 


$A = 17 


requires that ‘ABC' be looked up in the table so that the sym- 
bol table entry associated with 'ABC' (also called a variable 
block) can be plugged. The indirect goto is another example 
of where the symbol table is queried at run-time. As another 
example: 


OPSYN('ALPHA', *SIZE*) 


results in a copy of the function field of the variable block 
for SIZE into the function field of ALPHA. Conventional 
languages such as PL/I and Fortran do not retain a symbol 
table at run-time and hence cannot provide these capabilities. 


Whereas each of the SNOBOL4 processors retains a symbol table 
to house symbols required for an associative lookup, MAINBOL 
uses the symbol table for yet another purpose, viz. to store 
strings. All data strings are stored as symbols table entries. 
A certain economy of concept is thereby achieved at the ex- 
pense of significant inefficiencies in string handling. For 
example, TRIM(INPUT) in MAINBOL will read a record, hash it 
into the symbol table and call. TRIM which deletes trailing 
blanks and hashes the remainder into the symbol table. All 
such hashing is avoided in other processors. 


While interpreters generally retain the symbol table, com- 
pilers generally do not. Since it requires a volitional act 
for an interpreter to expel the symbol table and a volitional 
act for a compiler to produce it along with working code, the 
correlation seems to be the result of inertia rather than 
reflecting any essential relationship. In fact, exceptions do 
occur. Some compilers produce a symbol table optionally for 
debugging while some interpreters optionally expel the symbol 
table for efficiency. 


aa ne CT | 

{ ##% ypes of Compilers | A compiler, in the most general 
{ * e-———————' sense of the term, will translate a 
{ %& { program written in some language into some _ inter- 
1 & | mediate form which can be executed or interpreted by 
{ © | some other program. If the intermediate form can be 
t—-_-_-+ executed directly, the processor is called a conm- 
piler, in the narrow sense of the term. Otherwise it is called 
an interpreter. 


One of the most important questions that can be asked’ about an 
implementation is the form of intermediate code. Into what 
form, for example, will 


___ Types of Compilers. ___Page_227 


ALPHA * BETA + GAMMA 


be compiled. Different implementations of the same language 
may answer this question in different ways. The layman often 
believes that all SNOBOL interpreters leave the string intact 
to be interpreted anew each time the expression is evaluated. 
This is a kind of interpretation called pure interpretation 
and since the compiler has zero work to do, we will call the 
compiler a type-0 compiler. Some languages are implemented as 
pure interpreters (such as GPM, Program 18.8) but SNOBOL4 is 
not one of them. 


A type-1 compiler will convert indivisible syntactic units 
(called tokens) into pointers into the symbol table. For ex- 
ample, the expression above will be converted into 


i a RT | 
{ -——> ALPHA | 


--—_____-___--—-—H 
\ —> *C2) 1 


t— { 
{ —> +#¢€2) { 


}—-—______—___+4 
( —> GAMMA { 


where -——> ALPHA is a pointer to the symbol table entry for 
ALPHA, where —> *€2) is a pointer to the symbol table entry 
for binary *, etc. LISP [McCarthy, 1960] is an example of a 
language which employs a type-1 compiler. 


The searching for, and the conversion of, tokens into symbol 
table pointers is called lexical analysis. Most compilers more 
sophisticated than type-1 nevertheless precede other proces- 
sing with a lexical analysis. 


A type-2 compiler will rearrange the pointers into a form more 
suitable for execution. This can either be a Polish prefix 
representation in which the functions precede the arguments or 
a Polish suffix representation in which the function pointers 
follow the arguments. Each form is illustrated in Figure 11.2. 


Most interpreters operate on type-2 code. In particular, 
MAINBOL uses Polish prefix and SITBOL uses Polish suffix. 
Polish prefix is slower kut more flexible than Polish suffix. 
It is slower because with prefix code the function is encoun- 
tered first. When the function gets control it calls the 
interpreter to obtain its arguments. This call is necessarily 
recursive and hence slow. In Polish suffix the function is 
called after the arguments have been evaluated; there is no 
need for recursion. But Polish prefix is more flexible because 
certain operators can decide that they do not want to play the 
same game as Other operators. Unary *, for example, does not 
evaluate its argument but merely returns a pointer to it to be 
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Figure 11.2 


The result of a type-2 compilation of the expres- 
sion ALPHA * BETA + GAMMA may be (a) Polish prefix 
.or (b) Polish suffix. 


evaluated at some later time. In Polish suffix, unary * can't 
decide this on its own but needs the co-operation of the com- 
piler. This leads to other problems. For example, unary * 
cannot be redefined at run-time. 


The types 0-2 compilers are regarded as interpreters because 
the output (intermediate code) is not capable of being ex- 
ecuted directly by machine. A type-3_ compiler will produce 


code which can actually be executed. The above expression 
becomes: 

PUSH -——> ALPHA 

. PUSH —> BETA 

CALL —~-> *¢2) 

PUSH —> GAMMA 

CALL —> +(€2) 


where each function finds its arguments on the stack and 
replaces them with the result of its computation. For ef- 
ficiency purposes, registers can be used instead of the stack 
except for very deeply nested expressions. 


A type-4 compiler is one which produces optimal (or near- 
Optimal) machine code. The above expression is reduced to: 


LOAD ——> ALPHA 
MULT —> BETA 
ADD ——> GAMMA 


Most true compilers are combinations of type-3 and type-4. For 
example, Fortran I/O routines and trigonometric functions are 
handled with type-3 calls whereas infix operators (# * - /) 
and some arithmetic functions such as MAX and ABS are executed 
in-line in a type-4 manner. SPITBOL is almost entirely Type-3. 
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The only operation it does in-line is assignment. The reason 
that, for example, in-line addition can't be done is because 
variables are typeless and the compiler has no way of knowing 
whether A + B is floating point addition, fixed point or mixed 
mode. Assignment, on the other hand, even for strings and ar- 
rays, is comparatively simple since only a pointer and a 
datatype need be copied. 


It should be evident that as the sophistication of the com- 
piler increases (increasing type numbers) the speed of com- 
pilation decreases, the speed of execution increases and the 
flexibility of the run-time system decreases. For example, 
the type-2 rearrangement of operators is done so that 
operators will be where they are needed when it comes time to 
execute. This is faster but less flexible since it means that 
it is practically impossible to change the precedence of 
operators at run-time in a type-2 system; an irrevocable deci- 
sion is made at compile-time. 


oe ee tae ee 

| £#%8 loating Storage { The lack of declarations in SNOBOL4 
1 a (E.g., S is a string whose maximum 
{ £#% 1 length is 1000) implies that storage is not preal- 
{ # {| located for variables but rather is allocated on 
{ £ { demand. When storage is no longer in use it is freed 
_——____1 automatically by a so-called garbage collection 
process. 


In SPITBOL, SITBOL and MAINBOL the storage allocation scheme 
is basically the same. Allocating storage is ultra-simple. 
When a chunk of storage is needed it is taken from the begin- 
ning of a free region and the pointer to the free region is 
updated. When no free storage is left, the garbage collector 
is called. The first step of collection is a marking process 
in which all accessible blocks are marked as such. This is 
similar in spirit to the function VISIT (Prog. 5.10) and in 
SITBOL and SPITBOL it is actually implemented in the same way. 
Once the accessible blocks have been identified, they are 
moved together so that further allocations can be performed. 
Before the movement, any pointer pointing into or to a 
floating block must be adjusted. The term floating is used as 
it seems to correctly connote the relative ease by which the 
blocks may be moved about. The incorrect care and feeding of 
floating addresses while implementing a system such as SNOBOL4 
has led to many an implementation disaster. A useful rule of 
thumb is that one such error will lead to a day's worth of 
debugging sometime in the future. 


It is interesting to note that the predecessor to. SNOBOL4, viz. 
SNOEOL3, implemented its marking phase by means of a use~- 
count. Every time a variable's value is changed under such a 
system, the use~count on the new object would be augmented and 
the use-count on the old would be decremented. Marking 
consists of looking for nonzero use-counts. Where strings are 
the only datatype, as in SNOBOL3, this is not a bad scheme. 
If one can have structures pointing to other structures, 
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however, the scheme suffers from the prospect that two struc- 
tures pointing to each other may be inaccessible from the rest 
of the world and yet have nonzero use-counts. 


The method of implementing the garbage collector in SPITBOL 
and later copied over into SITBOL was especially clever. After 
visiting nodes in the manner of the function VISIT, the poin- 
ters are left in their reverse direction. This leads to a fast 
pointer adjustment phase as all the floating addresses which 
had been pointing to a floating block are then hung off the 
block in a linked list. The MAINBOL processor uses a more 
conventional marking phase using recursion much in the manner 
of COPYL (Prog. 5.8). Also the use of macros produced a slower 
system. The result is that the garbage collectors of SPITBOL 
and SITBOL are much faster than SNOBOL4. 


{ %&8 natomy Of a Processor | This section attempts to 
(1s £ —————————————’_ describe how a SNOBOL4 proces- 
1% & { sor is organized and which parts of it are exercised 
{ 88% | most frequently during the course of executing a 
| £ {| program. While such an analysis is application and 


5. implementation dependent, certain valid conclusions 
can nonetheless be drawn concerning the running of arbitrary _ 
programs against such systems. 


Most SNOBOL implementations tend to be implemented as one 
large assembly program and it is often difficult to breakdown 
the resource utilization into different functional compart- 
ments. The SITBOL implementation is an exception. It consists 
of 20 separately-assembled files segregated according to func- 
tion as indicated in Table 11.1. Each section is designated 
with a two or three-letter mnemonic as well as an indication 
of space occupied as a percentage of the whole. The approx- 
imate number of instructions in each section can be computed 
by multiplying the percentage by the total number of words 
(9300). 


The 15.5% figure for I/O in Table 11.1 is surprisingly high. 
It includes code to read and analyze the command string, set- 
up memory, provide a fairly rich collection of system 
facilities and interpret special i/o formats and make suitable 
conversions. The space devoted to the interpreter is padded 
by calls to produce run statistics at job termination plus a 
message interpreter. Hence the 7.3% figure is larger than what 
would normally be considered strictly necessary for the inter- 
pretation of Polish suffix. Also required in interpretation 
is all that machinery necessary to provide the correct number 
of arguments to functions, to evaluate arguments (convert 
variables such as A to the value of A, or convert INPUT to 
the next string read, etc.), and to interpret goto's and react 
correctly to failure. 


The compiler consists of a lexical analyzer (LEX) which makes 
calls on the symbol table manager (SYM) to convert source 
tokens to pointers into the symbol table which it feeds back 
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Table 11.1 The Decomposition of SITBOL. Regions are 
named by a short (2 or 3 letter) mnemonic. The Size is 
based on the number of words of assembled code and is 
given as a percentage of the total. The overall size 
was 9300 (36-bit) words. The storage considered is pure 
storage and does not include space for stacks, symbol 
tables, code blocks, etc. 


{ ! 
{ ! 
( | 
{ ! 
! ( 
{ t 
| { 
1 | 
{ Name Size(%) Description t 
ieee menanege= ta ira a ae acim eR --{ 
{ ro 15.5 I/O and system interface { 
{ INT 7.3 Interpreter { 
{ Gc 3.7 Garbage Collector { 
{ SYN 4.1 Syntactic Analyzer { 
{ LEX 4.4 Lexical Analyzer { 
{ SYM 7.9 Symbol table manager { 
{ STR 6.1 String handler ] 
{ SMR 2.1 Streaming (character set searching) 1 
{ PG 5.7 Patterns Global (pattern building and t 
{ the scanner) | 
{ PL 7.9 Patterns Local (built-in functions | 
{ and primitives) { 
{ NUM 2.1 Numeric functions t 
{ cvT 4.4 Datatype conversions (string <==> numeric) | 
{ ARY 2.0 Arrays (allocation & referencing) { 
{| KW 2.0 Keywords { 
{ TBL 2.9 Tables (allocation, referencing and ] 
{ conversion) { 
| DFF 3.5 Defined functions { 
{ DFD 1.8 Defined Datatypes { 
{ ERR 2.0 Error handling { 
{ TRC ded Tracing ] 
{ DATA 7.1 Assembled in strings, character sets, etc. | 
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to the syntactic analyzer (SYN). LEX makes calls on the 
streamer (SMR) to search for one of a set of characters. Thus 
the entire compiler represents 18.5% of the system with the 
syntactic analyzer only 4%. This is surprising in view of the 
great attention devoted to syntactic analysis in the litera- 
ture. The symbol table manager is bloated by an internal sym- 
bol table of approximately 450 words (4.8%) and a number of 
, Symbol table related functions such as CLEAR() and OPSYN(). 
The. actual machinery for locating and installing names into 
the symbol table is actually quite small. 


The relatively large quantity, 7.9%, of code for PL (Patterns 
Local) is attributable to the relatively large number of 
built-in patterns such as POS(n), BREAK(s), BAL, etc. 


The SITBOL system has a profiling capability which indicates 
where the system.is spending its time. One can obtain a user- 
oriented histogram (via statement numbers) or a system 
oriented one (via absolute addresses). This, coupled with the 
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physical. segregation previously described makes it fairly easy 
to determine the percentage of time devoted to each subac- 
tivity. ‘Table 11.2 summarizes the results of running the 
profiler for 6 typical string applications. The last colymn 
indicates a composite figure obtained rather arbitrarily by 
averaging the other 6 figures. | 
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Table 11.2 Shows the percentage of time spent in 
various regions of SITBOL for a variety of string- 
processing problems. 


| | 
| | 
{ { 
{ ( 
{ Region | L6 Renum TPST Pre Sort Refm | Comp | 
[<See=s=es2 a cal cd at al li acai ee seret= { 
1! ro { 3.0 4.5 2.9 20.7 11.8 | 7.1 =| 
1 INT { 27.0 18.1 30.2 38.8 73.6 33.8 {| 36.9 | 
{ cc 1 40.2 34.8 20.1 20.1 4.2 | #+%19.9 § 
{ SYN { { 1 
{ LEX { | ( 
{ SYM ( 2 2 9 { 1 { 
{ STR { 13.8 26.3 13.3 5.4 9.8 27.1 | 15.9 | 
{ SMR { 1.0 1.6 7.0 1.14 2.3 | 2.3 f 
{ PG | 4.9 7.2 8.1 2-8 Ber 4 4.8 | 
{ PL { 2.3 4.2 7.6 3.1 169° 4 3.1 | 
{. NUM { wit 1.7 1.7 1.1 | 8 |f 
{ CVT { 8 1.8 8 5 2.9 1.3 | 1.5. °4 
{ ARY { 1.4 H 2 | 
{ Kw | | | 
{ ‘TBL | o2 1.0 a2 | 2 | 
{ DFF { 6.0 7.8 4.3 3.2 10.1 Jf 5.2 | 
{ DFD | “2 1.5 4.3 | 1.0 | 
{ ERR | { i 
{ TRC | 3 5 3 o3 2.9 2 | 7 { 
{ DATA 1 | { 


L& is a compiler. Renum renumbers the statement labels of 
Fortran programs. TPST (Typeset) is a program to format 
paragraphs and uses functions virtually identical to those in- 
dicated in Chapter 10. Pre is a pre-processor for Fortran 
which inserts common areas at the beginning of subprograms and 
does minor data massaging. Sort is a linked-list sort of a 
kind identical to Prog. 13.3. Refm reads a file with mixed 
tabs and blanks separating 4 fields and writes out the file 
with columns alligned using tabs as needed. With one exception 
(Sort) all programs were complete programs so that time spent 
in I/70 and other necessary but unrelated activity would be 
included in the timing statistics. Not included as is 
evidenced from the data itself is the time spent compiling. 


The composite figure indicates the rather striking fact that 
over one-third of the time is spent in the interpreter. Most 
of this time would drop to nil if SITBOL had been a_ compiler. 
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However a compiler version of SITBOL would almost certainly be 
larger by close to the percentage of time saved so that the 
cost (measured in core-seconds) would be the same. The impor-~ 
tant issue is that the interpretive time is not larger than it 
is. Substantial amounts of time are going to other things such 
as garbage collection (20%), string processing (15%), pattern 
matching (FG, PL and SMR, 10%) and IO (7%). It is only in ap- 
plications such as Sort which use few of the facilities of the 
language (no storage allocation, no pattern matching) that the 
interpreter time is really excessive. Thus semantically rich 
processors such as SNOBOL4 have two reasons for being written 
as interpreters. The semantical richness is easier to write 
and there is not that much being lost. 


Comparing individual columns it may be seen that the vre- 
processor Pre spends relatively large amounts of time doing 
I/O because it has virtually no work to do on most lines read. 
The relatively low figure of 18% interpreter use in the For- 
tran renumbering program is probably do to the heavy use of 
concatenation and pattern matching and the rest of the data 
bears this out. TPST spends by far more time in SMR than do 
the other routines and this is because it is continually scan- 
ning for USCOREsS and BSPACEs as was pointed out in Chapter 10. 
The PDP-10 has no automatic scan instruction like the IBM 360 
but nonetheless even in this exagerated use of the BREAK func- 
tion, relatively little time (7%) is spent streaming. The DFF 
entry indicates the amount of time spent in function calls and 
is relatively small even for heavily recursive applications 
such as Sort. The amount of time spent in this category had 
more to do with the structuredness of the program. TPSET, as 
a look at Chapter 10 would reveal, is well-modularized and a 
certain price must be paid, but the cost is not excessive. It 
is somewhat surprising that areas such as numerics, conver- 
sions, tables, arrays, def ined-datatypes, and keywords 
represent so little of the total time (3.7%). Even, for exam- 
ple, when the defined datatypes are used rather heavily as in 
Sort, the amount of time spent in DFD is relatively small 
(4.3%). 


How do these figures compare with the corresponding figures 
for MAINBOL and SPITBOL? Since SPITBOL is type-3, the time 
spent in INT would be reduced substantially and, to a first 
approximation, all other activities would experience a propor- 
tional increase (just to make up the 100%). The Garbage Col- 
lection time would be reduced somewhat because SITBOL, 
operating in a time-sharing environment, deliberately keeps a 
"low profile' to keep a relatively good priority. This results 
in garbage collections every 1500 words or so which is quite 
frequent compared with batch-oriented systems such as SPITBOL. 
The STR (String Handling) area would also be reduced in 
SPITBOL because the IBM 360 is a byte-orented machine with 
certain built-in string operations. The result is that SPITBOL 
should be more nearly balanced in its overall profile with 
much of its time being spent in pattern matching, defined 
functions, IO and garbage collection. This, however, will 
depend considerably on the application. MAINBOL has an inter- 
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considerably on the application. MAINBOL has an interpretive 
loop about twice as slow as SITBOL and has a much slower gar- 
kage collection, pattern matcher and I/O, Since overall 
program time goes up by more than a factor of 2, the time 
spent in the interpreter for MAINBOL would actually decrease 
(to say 25%). IO, GC, PL, PG and SMR times would increase 
whereas other times would likely remain roughly the same. 


a aaa anor | : ry 
To accumulate his own timing statistics, 


{1 Program 11 

i! 11.1 (1 the programmer will make calls on _ the 
{4 RESOLUTION {f built-in function TIME().. The value 
______--—-_4 returned is not uniformly increasing, but 
rather rises in steps which are sometimes rather large. On 


Many systems the step size, called the resolution, is one- 
sixtieth of a second which is fairly large as many things can 
-happen during this time period. It is essential to know or be 
able to compute this resolution to obtain accurate timings. 
Fortunately, this is rather easily done. 


DEFINE (* RESOLUTION () T*) : (RESOLUTION_END) 
aa aR ST RES SIE TA ERT RC RY, | 
{ Entry points: Initialize T to the current time. Then | 


{| repeatedly set RESOLUTION to the difference between the | 
{| current time and this initial time. When it goes positive, | 


{ the smallest resolution is obtained. | 
ce ncn asmeesasreenni san motes mimi ese nd ie i ci ag ce a icin pea 


RESOLUTION T = TIME() 
RESOLUTION_1 RESOLUTICN = TIME() - T 
GT (RESOLUTION, 0) :S (RETURN) F (RESOLUTION_1 


RESOLUTION_END 


Epilogue 


Since TIME() returns an integer in milliseconds, it is 
possible that the resolution may be off by as much as a mil- 
lisecond. For example, on the IBM 370 Mod 165 the interval 
timer resolution is 3.3 and RESOLUTION returns 3 two-thirds of 
the time and 4 one-third of the time. In such cases, 
RESOLUTION could be modified to return a constant known value. 
But it should be remarked that only an approximate value for 
the resolution is ever needed. Exercise 11.6 explores another 
possibility for improving the behavior of RESOLUTION. 


Se ee 

{i Program {| The timer routine shown below will time a 
i 11.2 1! statement (or statements) passed to it as 
i! TIMER | arguments. Thus 

Cnn rereenensenocmrmeppaniomnsenmaesannell 


TIMER(' A = B # C. *) 


will determine how much time is required to execute the given 
assignment statement and will print appropriate statistics. 
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If more than one statement is to be timed they should be 
separated by semicolons. 


To time a statement it is placed in a loop and executed for 
several times longer than the resolution of the clock. In 
order to deduct the time required to increment a counter and 
test, the loop is executed twice, once with the statement in 
and once with it out. 


DEFINE ("TIMER (S_,N_)C_,T_,I_') : (TIMER_END) 


aa aaa a A a eI | 
{ Entry Point: On first call, fall through. When TIMER is | 
{ called recursively, N_ is nonzero and control passes to | 


| TIMER_N. ' 
acts ihn ii ima ti leicester lain veaicrgaelcpisleieteh asl 
TIMER EQ(N_,0) : F (TIMER_N) 


Ge a ne ee a pe TO NR Oe oe ef ee le ee ee 
{ Starting with 10 executions, double the number until the | 
| difference between the times required to execute and not | 
{ execute the given statement is 20 ticks of the clock. { 
Nees coc trinsic ed ln ca oration eit sitasaaamciaaniateell 
N_ = 10 
TIMER_1 T TIMER(' ;' S_,N_) - TIMER(,N_) :F(FRETURN) 
N_ LT(T_,20 * RESOLUTION()) N_ * 2. :S(TIMER_1) 
SS SS SSS SS eS ee 
{ Now print the results. ! 
Senne ee CRO ON er OT NE ce ea oP LR PORE Sc OR Ne OEE ETS, ere eT OREN 


T = CONVERT(T_, 'REAL') 

OUTPUT = 

OUTPUT = 'THE STATEMENT! 

OUTPUT = S_ 

OUTPUT = "REQUIRED ' (T_ / N_) ' MILLISECONDS +/- 10% 
Pi ' TO EXECUTE IN * SYSTEM () : (RETURN) 


Ge Re ge ee ee ee eee ee ee eee ee pee ee eae ee ee EN 
| Here if N_ is nonzero. Prepare a string C_ which will be | 
{ compiled and executed and will contain the statement to be | 


{| measured together with a control loop. | 
at sence es cence pteteettnap Se es-ES st i lhSPnIPuPvne-nsstpatoensnaesenemanasranmemeannna-al 


TIMER_N r= 1 
C_ = ¢ COLLECT() ; TIMER = TIME() 3! 
+ "TIMER 3" S_ tt 
+ ' To =I_+ TLT(i_,' N_ ‘) 2S (TIMER_3);! 
+ ‘ TIMER = TIME() - TIMER : (RETURN) ! 


De ee a NR NE re TS Ee ee et ee ee ee a 
{ Compile the string and, if successful, execute it. | 
a re i nine epinions.com snmp staccato mee ena as smeomaamnare 


C_ = CODE(C_) :S<C_>F (FRETURN) 
TIMER_END 
Names_referenced Name Type Where defined 
by_ TIMER: SYSTEM Function Program 11.3 


RESOLUTION Function Program 11.1 
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Epilogue 


Note that the temporaries and arguments are given ‘funny! 
names, i.e. ending with the underscore (_) character. This is 
to avoid conflict with variables in the statement being timed. 


Wormiote << prtegon, | Oo 

It Program {{ SYSTEM() is a function which will attempt to 
| 11.3 1 determine which of the various SNOBOL4S 
1{ SYSTEM tt processors it is running under. For example, 
t__-_________._} under SPITBOL, SYSTEM () will return 


'SPITBOL'. The function is not easy to write because if there 
is a difference between any two processors this may be 
regarded as a deficiency and may get fixed sometime in the fu- 
ture rendering the function we're about to write invalid. 


‘One of the main differences between the various systems is in 
functions and/or keywords implemented. Unhappily, one cannot 
test directly for the existence of such functions or keywords 
so knowing about such differences does us no good. 


SYSTEM() was used to identify which implementation was being 
measured by TIMER and is provided more for its intrinsic in- 
terest than its necessity. 


DEFINE ("SYSTEM () K') : (SYSTEM_END) 


ne 
| Entry point: First separate out MAINBOL from the other | 


{ processors. Only MAINBOL regards .X as a string. { 
| Re nn EEE | 


SYSTEM IDENT(DATATYPE(.X), 'STRING') _ :F (SYSTEM_2) 
SSS SS 
| Falling through implies MAINBOL. Now separate out the | 


{| various systems on the basis of the SIZE of S&ALPHABET. The | 
{ Honeywell 635 uses a 9-bit code. IBM equipment uses an | 
| 8-bit character while the PDP-10 uses 7-bit ASCII. | 
Se re eg Ce A TT | 


K = SIZE(&ALPHABET) 

SYSTEM = EQ(K,512) ‘HONEYWELL MAINBOL' :S (RETURN) 

SYSTEM = EQ(K,256) ‘IBM MAINBOL' :S(SYSTEM_1) 

SYSTEM = EQ(K,128) ‘'PDP-10 MAINBOL' :;S(RETURN) | 
Gor ett a en OP pe ee er Te Pee Ye eg ee Ce Ee ee eT TE ERE ee eee eee ll 
{ Both CDC and UNIVAC MAINBOL's use 6-bit codes. We can | 


{| distinguish between these two systems by the order of | 
| characters in 6S&ALPHABET. Only CDC contains () as adjacent | 
{ characters. | 
a a a 


SYSTEM = ‘CDC MAINBOL' 
SALPHABET '()! :S(SYSTEM_1) 
SYSTEM = ‘UNIVAC MAINBOL' : (RETURN) 


ae eh 
{ Here to test if the system also contains blocks. The 
| operator sharp (#) will have a lower precedence than blank 
{| if the blocks extension is available. If the value of T is 
{ 1 (5 + 5) then we're in pure MAINBOL. Otherwise we've got 


Anatomy of a SNOBOL4 Statement _ ___._Page 237 


| blocks. 1 
a eA ee RE A Oe a a ST | 
SYSTEM_1 OPSYN('OLD_SHARP!,'#', 2) 

OPSYN('#', t+", 2) 

Tr = 1 545 

OPSYN('#*, 'OLD_SHARP', 2) 

EQ (T, 110) :S (RETURN) 

SYSTEM = SYSTEM ' WITH BLOCKS? : (RETURN) 


ee ee ee ge ee oe a ee 
| Here if not MAINBOL. FASBOL has an unorthodox SUBSTR func- | 
{ tion. | 
Nice eines hpi pcp psi pi cep toeinicencicmensel 
SYSTEM_2 

SYSTEM = DIFFER (SUBSTR ('ABC',2,1),'B") *FASBOL' 
+ 3S (RETURN) 


a ee 
{ SITBOL, running on the PDP-10, can easily be distinguished | 
{ from the IBM SPITBOL by the size of S&ALPHABET. { 
| en | 
SYSTEM = EFQ(SIZE(&ALPHABET) ,128) 'SITBOL' :S (RETURN) 
SYSTEM = ‘'SPITBOL® 3: (RETURN) 
SYSTEM_END 


Epilogue 


The above function is obviously incomplete as it does not 
include all machines for which MAINBOL has been expanded. If 
your favorite processor is not among the group you are 
encouraged to modify the program to include it. 


SSS 

{ £% natomy of a SNOBOL4 Statement | In this section we 
(t£ £ -—————_—_———————' will study the time 
(% %* | requirements of SNOBOL4Y statements. Such an analysis 
{| #488 | may at first blush seem rather difficult because in 
{( # % { .a language as rich as SNOBOL4 there is ‘so much 


i————J going on’. But just the reverse is the case. For 
example, Table 11.3 shows the times required to execute in 
SPITBOL and MAINBOL a sequence of four statements in ascending 
order of complexity. TIMER, Program 11.2, was used to time 
these statements and is responsible for other similar timing 
figures given in this section. All times in this section were 
made (or normalized to) an IBM 360 Mod 65. For possible com- 
parison with other processors, some representative instruction 
times are given in Table 11.4. 


In Table 11.3, we see that the null statement (statements 
which do nothing) consume relatively little time; i.e. state- 
ment overhead is relatively small. Assignment is fairly fast 
since, for all datatypes, it is merely a descriptor (two 
32-bit words) copy. But the most notable thing about Table 
11.3 is that there is a linear relationship of time with the 
number of arithmetic operators. 


This relationship is more nearly linear in an interpreter or 
type 3 system because the various operations are '‘packaged' 
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{ Table 11.3 Time in milliseconds reguired to 
{ execute a sequence of arithmetic assignment 
{ statements. : ; 

{ 
{ 


1 

Statement SPITBOL MAINBOL ( 

[enh ero sen ese se eaa ne eoeeeseees ae eee wenann= 
{ A - 0012 02 { 
{ A= 11 004 - 10 { 
{ A=I+Jd - 009 30 | 
{ A=I+JdJ+ kK 2015 50 t 
{ A=I+gJ+k +L 2021 -70 ] 


more so than in a type-4 compiler. In a type-4& system, code 
optimization techniques render more interaction between opera- 
tions of the same expression so that the time of a statement 
is not simply the sum of the times of the component 
operations. 


Measuring the time of an operation which does not generate 
storage is fairly straightforward as the direct measurement by 
TIMER may be used. If the operation generates storage which 
must later be collected, an additional increment of time 
should be charged to such an operation. We will see later how 
this can. be done. 


Arithmetic Table 11.5 shows the time required for arithmetic 
operations. In MAINBOL the time is dominated by overhead so 
that all operations, even exponentiation, take pretty much the 
same time (about .2 milliseconds). This even includes the case 
where one of the operands must be converted to string or real, 


1 Table 11.4 Selected instruction times for the IBM 
{ 360765. (N is the number of characters involved ina 
{ multiple-character operation.) 
| 
| 
| 


{ 

i 

I 

{ 

Operation Time { 

(microseconds) t 

be Se rn ee en Se ee ge ee | 
{ Load (1 word) 95 t 
{ Store (1 word) 93 { 
! Add (storage-to-register) 1.65 i] 
{ Floating add (storage-to-register) 1.68 { 
{ Multiply (storage-to-reg.) 4.45 | 
{ Divide (storage into reg.) 9.00 { 
{ Compare (reg. with storage) 1.40 { 
{ Branch 1.10 1 
] MVC (storage-to-storage move) 3 + .3N t 
{ CLC (storage-to-storage compare) 2.9 + .3N { 
{ TRT (SPAN & BREAK) 4.14 1.2N { 
{ TR (REPLACE) 1.9 + 1.8N { 


Ned 


In SPITBOL, as may be expected, the overhead has been reduced 
to the point where variations in the natural execution times 
do show up in the time for the overall operations. Thus, in- 
teger division (.019) is longer than integer multiplication 
(.014) which in turn is longer than addition (.007) which 
reflect differences in the absolute times to perform these 
instructions (.009, .005, and .001 respectively). 


eee 


{ Table 11.5 Time in milliseconds to carry out selec- | 
{ ted arithmetic operations in SPITBOL and MAINBOL on  {f{ 
{ the IBM 360/65. ] 
{ { 
{ Data Type Operation Data Type SPITBOL MAINBOL {| 
EES ee ee rare ee oe ar oe ae ere eae Sea ae ae ene ee { 
{ integer + integer -007 2 ] 
{ integer - integer -007 22 { 
{ integer * integer 2014 2 { 
i] integer / integer 919 o2 { 
{ integer ** integer -039 2 { 
{ integer REMDR integer -035 — 18 | 
| integer + real -061 2 { 
{ real + integer -067 2 { 
{ real + real -016 2 { 
{ integer + string (2) -084 22 { 


i a ann nee eee ee | 


Table 11.5 shows a ratio of improvement of SPITBOL over 
MAINBOL which varies from about 25:1 in the case of integer 
arithmetic to about 2.5:1 in the case of addition with one ar- 
gument a string. This is because, in the latter case, the time 
is dominated by the conversion, and this MAINBOL does within a 
single macro, so that the SPITBOL approach grants no 
advantage. 


-Flow_of Control Various operations associated with flow of 
control are given in Table 11.6. These figures should be suf- 
ficient to predict the time of simple looping control 


instructions. 


For example, the standard method of implementing a loop in 
SNOBOL4 is some variant of 


N 0 
LOOP N N+ 1. LT(N, 100) :F (LOOP_OUT) 
: (LOOP) 
LOOP_OUT 


which will execute the inner part of the loop 100 times. The 
statement labeled loop will be executed 100 times before 
failing. Predicates such as LT() will return the null string 
when they succeed as this is the least flagrant value they can 
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{ Table 11.6 shows time in milliseconds of flow-of- 
{ control type operations for SPITBOL and MAINBOL. 
{ a 
I 


= 
( 

! 

Operation __ SPITBOL MAINBOL { 
(2S a Renpbe Saat ose sestenas stor 
{ GT, LT,EQ,LE,GE,NE 202 e2 { 
{ IDENT, DIFFER 02 22 
{ Ler -05 35 { 
{ Null Concatenation 02 2 | 
{ Label Goto - 027 017 ( 
{ Code Goto 2037 -20 | 
1 Function call (N = . { 
{ # of args and temps) -09+.012N -40+.03N { 
| 


return. Concatenation treats null as a special case simply 
returning the other value and hence is very fast. 


The time to execute the statement labeled LOOP can be obtained 
by adding the times for assignment, addition, LT() and null 
concatenation which yields .70 for MAINBOL and .051 for 
SPITBOL.- To this should be added the time to execute a label 
goto which brings the total control overhead to .87 and .078 
milliseconds respectively. 


The time to execute a goto is influenced slightly by whether 
its a fail goto or a success goto, and the actual configura- 
tion of the goto portion of the statement. The figure given 
in Table 11.6 is simply an estimate usable mainly because the 
transfer of control consumes, normally, a very small portion 
of the total time. The total time required by a function is 
found by adding the function overhead time, given in Table 
11.6 to the time required to execute the function's state- 
ments. The time of a RETURN (or FRETURN) is aksorbed in the 
function overhead. 


Miscellany Table 11.7 contains a miscellaneous collection of 
times for a number of different operations. Some of the opera- 
tions generate storage which will lengthen subsequent garbage 
collections but the times given do not reflect this cost (see 
the Epilogue of TIMEGC, Prog. 11.4). It is interesting to note 
that with the indirect reference (unary $) the time required 
by SPITBOL and MAINBOL are almost the same. Because MAINBOL 
hashes all data strings it does not have to hash for indirect 
reference. SPITBOL does, but the hashing does not take as long 
aS MAINBOL's interpretive loop. Pattern Matching The execu- 
tion of a pattern matching statement consists of five distinct 
parts: subject evaluation, pattern evaluation (pattern 
building), pattern matching proper (scanning), object evalua- 
tion and replacement. Not all of these operations need be 
present. The time to execute such a statement is the sum of 
the times of its component parts. The subject and object 
evaluation are in the same category as ordinary expression 
evaluation. The replacement operation is approximately equiva- 
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cee eer ree eeme coerce eanereernnme eS 9 Ca SEE CARN NE AL STAT AD TITY SAE SD RAED CETTE SST SD 


Table 11.7 shows timings of miscellaneous opera- | 
tions. N, where indicated, is the number of charac- | 
ters involved in the operation. Times do not include | 
garbage collection overhead. | 
{ 
{ 


( 

{ 

{ 

{ Operation SPITBOL MAINBOL 

{ AO ce ce eae om wee cae ae a ee a ne ee ee ee tm ee we oe ee 0 Oe ae oe ee oe 0 ee 0 oe oe oe oe we es ee we ee me ee we ee om ee ee 
{ Concatenation -905+.0005N -35+.0005N | 
{ SIZE 023 13 { 
{ DUPL (of a single char) -045+.0003N -6+.027N { 
{ $ (indirect reference) 09 «12 { 
{ PROTOTYPE 016 13 { 
{ A<I> 203 - 30 { 
| A<I,d> -07 245 { 
{ ARRAY (N) .06+.03N -7+.03N 1 
{ CODE(' X = ¥ + Z :(LA)') 1.53 3.7 { 
{| EVAL ('LGT(S1,S2) *") 1.2 3.1 | 


ne src wt ens Sno iP ss ess el sn lS SS SSSI s>e-sn-ossoinamanansasoenssll 


lent in time to two concatenations and is given in Table 
11.10. 


The time required to build a pattern is, to a first approxima- 
tion, proportional to its size. Table 11.8 contains some 
representative times for the construction of patterns. 
Variables A, B and AB are used rather than constants ‘A', ‘Bt 
and ‘AB' because SPITBOL precomputes any constant~valued ex- 
pression such as ‘A® { "Bt. As indicated in the table, the 
time is measured in the absence of garbage collection. As we 
will see, garbage collection will approximately double this 
figure. 


{ Table 11.8 indicates timings (in milliseconds) of | 
{ selected pattern-building operations. Times do not | 
| include that attributable to garbage collection. { 
| | 
| | 
| ( 


No. of 
Pattern expression SPITBOL MAINBOL Primitives 
Paar e as Mees See eR SH See ee eee Sees Sr Ser eee See ese ee eet ese { 
{ AT B 167 -80 2 { 
{ (AUB). x -466 1.1 4 { 
{ (A { B) .X (A{ 5B) .Y 1.16 7 i 8 | 
{ BREAK (A) 07 36 1 { 
{| BREAK (AB) 212 -36 1 | 
{ BREAK(AB) . X 41 93 4 { 
{ BREAK(AB) . X LEN(1) 257 1.78 5 { 
ae a a ka aaa a a aaa | 
{ where: A = 'A', B= 'B', AB = "AB! { 


| SS cea ea EO a EE EAT a Eee A | 


To a first approximation the time required for pattern mat- 
ching proper (scanning) is some fixed overhead given by Table 
11.10 plus the total attributable to individual primitive 
matches (and failures) as given by Table 11.9. Thus the pat- 
tern match below 


S = DUPL(‘'A't, 100) 
S (‘At | "By tect 


will have approximately 3N primitive matches, N _ successful 
matches by 'A', and N failures each by 'Bt and 'C'. Table 11.9 
indicates that in SPITBOL it requires .04 milliseconds per 
string primitive resulting in a total time of 12 milliseconds 
plus overhead. 


{ Table 11.9 Primitive matching time in Milliseconds | 
| per Character for selected primitives. N indicates | 
{ the number of characters matched for multi- | 
{ character operations. { 
{ t 
{ Primitive SPITBOL MAINBOL { 
ee Se ae ee ee ee eee ee eee { 
{ String -040 - 18 | 
| RPOS (N) -020 -20 { 
{ LEN (N) -020 - 20 { 
{ POS (N) -020 -20 { 
{ NOTANY (S) -028 24 { 
{ NOTANY (*S) 071 42 { 
{ SPAN (S) . 040+ .0014N - 25+.0014N t 
( BREAK (S) - 040+.0014N - 25+. 0014N ] 


| ne an 


Table 11.10 Other miscellaneous timings associated 


{ 1 
{ with pattern matching. Times are in milliseconds and | 
{ are approximate. { 
1 ! 
{ Operation SPITBOL MAINBOL { 
a an ea eee ee a SP ee eee een ee ease ee eel 
{| Matching Overhead -09 5 { 
{ Replacement - 082+ .0005N ©424+.0005N | 
{ Pure String Scanning Rate 0014 04 { 
{ (per character) { 
{ ARBNO, per iteration -010 - 26 1 
{ GBAL -0434+.017N - 22+.033N H] 


The reader is cautioned that this analysis is approximate. The 
time required to scan (P1 | P2) will be less than the sum of 
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the separate scanning times. Also failure will be slightly 
different than success. If differences on the order of 20% or 
so are Significant the reader is urged to make his own timing 
tests of time-critical statements. 


The reader should also note that pattern matching heuristics 
play a significant role in affecting the overall time. Thus 
the pattern 


POS (143) ‘caT# 


will result in two primitive matches in SITBOL AND SPITBOL 
because of the POS heuristic (see Chapter 7) but will require 
145 primitive matches in MAINBOL (assuming the subject is long 
enough) . Also, the futility heuristic can greatly reduce the 
number of primitives matched. 


When the pattern is a simple string, SPITBOL and MAINBOL treat 
it as a special case resulting in a faster scan as indicated 
in Table 11.10. If ARBNO appears in a pattern, then to the 
time required for all primitive matchings must be added the 
sum of all ARBNO extents multiplied by the given weighting 
factor given in Table 11.10. BAL, as indicated in Chapter 7, 
is implemented by the repeated use of a primitive GBAL which 
matches the shortest nontrivial balanced string. Thus BAL will 
match the string ' (XXXX)* with one application of the  primi- 
tive GBAL and will match *XXXXXX* with 6 applications of GBAL. 
Hence it requires much less time to match the former than it 
does the latter. For example, in MAINBOL, it requires .22 + 
(.033) (6) MSEC. to match * (XXXX)* whereas it requires (.22) (6) 
MSEC. to match *XXXXXX*. 


I/Q_ Timing When INPUT is mentioned in the source program, a 
line is read. How long does it take? This has no easy answer. 
Clearly different devices require different times. Even if we 
restrict our attention to one device, such as the disk, the 
issue is compounded by a host of factors. As a rough rule of 
thumb the total time required to move the arm of a-disk drive 
into position (seek time) and wait for the information to come 
under the read heads (latency) plus the amount of time to ac- 
tually read is, to grossly simplify, in the order of 100 mil- 
liseconds. This figure is not normally charged directly to 
the user since the operating system can direct the cpu to do 
other things during the interim. This represents an extra- 
ordinarily complex situation not made less so Ly a variety of 
charging algorithms and scheduling philosophies. A rule of 
thumb is that the effective cost is equivalent to half the 
elapsed time. Hence, for disk, one may assume 50 milliseconds 
per transmission. Since the time of transmission is relatively 
independent of the amount transmitted it pays to transmit more 
than one line at a time. Hence, lines are transmitted in what 
is called a block. The number of lines per block is’ called 
the blocking factor. Typical blocking factors for efficient 
disk I/O is on the order of 100 which converts the effective 
transmission time to .5 milliseconds per line. 
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To this we must add the processing time to extract a given 
line from a buffer. This again will require rule of thumb 
estimates. In MAINBOL a rather slow Fortran conversion routine 
causes an I/O operation tc require 5 milliseconds per line 
(IBM 360 Mod 65). Hence if the file is properly blocked, I/0 
times are dominated by this figure. In SPITBOL, Fortran I/0 
is sidestepped and the required processing takes about half a 
millisecond. Hence, in SPITBOL, an I/O reference requires a 
total of approximately one millisecond. 


tI | The following program will permit the caller 
" 11.4 (1 to time a 'typical' garbage collect. 
t] 1 Strings, array elements and programmer- 
t-__-________4 defined datatypes are strewn about in rather 
chaotic fashion and a call is made to clean some of it up. An 
argument to TIMEGC can be given which will alter the amount 
and somewhat the type of litter. The caller may experiment 
with other values of this number as well as with different 
kinds of allocation to see if the garbage collect time 
significantly varies. 


DEFINE ('TIMEGC (N) I,S,A,L,T, K,FREED') 
DATA (* LINK (VALUE, NEXT) ') : (TIMEGC_END) 


Ga sce aS et an el ta Deg Re ae PT Le pee ha, POR RE EEO APL a OE 
{ Entry point and top of loop. Free everything and issue a | 
1 garbage collect. { 
| re Ee | 


TIMEGC I= 3; S= 3} A= 3 Le 
COLLECT () 
N = IDENT(N) 25 
A = ARRAY(N) 


ee en ea Te pee eR ea Oe a ee ee ee ae ee en ee ee ee ee 
{ Allocation loop: For each I from 1 through N allocate ap- | 
{ proximately one length-80 string, assign a length I string | 
{ to A<I> and add one element to the linked-list L. { 
nen esate truss ss ns PES E/T sr PDP SS PD GEneeeonennsasesnnemnareseemasel 


TIMEGC_1 I= ~4+1 
$I = DUPL(* ',78) I 
A<I> = DUPL('*',I) 
L = LINK(NULL,L) 
GE (I,N) :F (TIMEGC_1) 


Gn ec a a Se ee ee a Bee Se eee Te eS ee we ee eee a Se TN 
| Determine the storage remaining. Then loosen about half of | 
{ it and issue a garbage collect. Determine how much was | 
{| collected and how long it took to make the collection. | 
ee | 


STREM = COLLECT() 
TIMEGC_2 
$I = 3; =ACI>D = 3 L = NEXT(L) 
I = I-2 GM(I,2) :S (TIMEGC_2) 
| = TIME() 
FREED = FREED + (COLLECT() - STREM) 
TIMEGC 


= TIMEGC + (TIME() - T) 
K = K + 1 : 


Cer a gn Fe Ce et ET eT hy eee ee ee a ee 
{ If not significantly more than the resolution of the | 


{| clock, go back for more. Otherwise produce some | 
| statistics. { 
ee ae ee cee a ne NE a ER ACE eee OTE | 
LT (TIMEGC,50 * RESOLUTION () ) :S (TIMEGC) 
OUTPUT = 
OUTPUT = "IN ' SYSTEM() * * K * GARBAGE COLLECTS'* 
+ REQUIRED A TOTAL OF ' TIMEGC *' MILLISECONDS TO FREE * 
+ FREED ' STORAGE UNITS. ! 
TIMEGC = CONVERT (TIMEGC, ' REAL) 
OUTPUT = 'THIS AVERAGES TO § (TIMEGC / K) "MSEC. PER! 
+ * GARBAGE COLLECT AND * (TIMEGC / FREED) ' MSEC. PER' 
+ * STORAGE UNIT. * : (RETURN) 
TIMEGC_END : 
Names_referenced Name Type _ Where defined 
by_TIMEGC: RESOLUTION Function Program 11.1 
Epiloque 


TIMEGC(N) was called for various values of N and the results 
are given in Table 11.11. 


{ Table 11.11 Data obtained by calling TIMEGC with a | 
{ variety of arguments. { 
{ | 
{ { SPITBOL t MAINBOL { 
{ | j { 
( { Ave GC Storage Time | Ave GC Storage Time {| 
{nN ¢ £Time coll. per byte | Time coll. per byte| 
( { (MSEC) per GC (Mcrsec) {| (MSEC) per GC (Mersec) | 
See et ne ee ere 2 er eer ae 5 a ca ai oa ocala { 
{ 50 | 17 3.4K 5.0 | 98 5.8K 17.0 | 
1100 4 27 8.1K 3a3 1 105 13.5K 8.9 | 
(150 4 41 14.0K 2.9 { 144 21.6K 6.7 | 
4200 | 51 21.3K 2.4 { 196 31. 5K 6.3 | 
1250 | 77 30. 0K 2.6 { 220 42.6K 5.2 | 
1300 4 104 39.4K 2.6 { 224 55.0K oe | 
1350 | 138 50. 0K 2.8 { 256 68.4K 3.9 | 
1400 | 183 62.4K 2.9 ! 304 83.3K 365.--| 
1450 | 210 76.0K 2.8 1 343 100 K 3.5 {| 


As might be expected, the time to garbage collect is a func- 
tion of how many allocated objects are lying about in core. 
For small collections, SPITBOL has a clear advantage over 
MAINBOL; but this advantage curiously diminishes as the col- 
lections become larger. (This anomaly has yet to be 
explained.) Also, as collections get larger, the time required 
per byte collected seems to converge to about three 
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microseconds. This figure is not absolute since garbage col- 
lections in which very little storage as a fraction of the 
whole is retrieved can require much more than this. Neverthe- 
less, it serves as a useful rule of thumb for estimating the 
garbage collection overhead attributable to an operation that 
allocates storage. For example Table 11.7 indicates the time 
for concatenation to be .05+.0005N milliseconds in SPITBOL. 
To this we must adda factor attributable to later garbage 
collection. In SPITBOL, a string requires 6 + N bytes of 
storage as indicated in Table 11.12. Using a figure of 3 
microseconds per byte, the real cost of concatenation is .068 
+ .0035N milliseconds. 


Table 11.12 shows the amount of storage required for a 


OO ae ee Oe ee Oe ee OP me OF Oe OP ee Oe we OP 8 we ee 8 ow we ee a oe ee oe 08 nw 08 net es ee 


{ * If the argument to FPREAK or SPAN is only one character, 
{ no additional storage is required (B is 0). 
(0a sonincusaress tsar rishi ss E-series ener cman 


( 
( 
( 
| { | 
{ Datatype { SPITBOL { MAINBOL. 
i at ad a ata ad 5 aaa cata aaad Peres Ssses 
| String (N is no. of chars.) H 6 +N { 32 +¢+N ] 
t { { { 
{ Variable (N is number { { { 
{ of characters in name) | 38 +N { 32+N 1] 
{- . ( | { 
1 Patterns (N is no. of primitives, { { ] 
( A is no. of ANY, NOTANY's, { { { 
{ B is no. of BREAK & SPAN's*, 1 16 + 16N + | { 
{ figure is approximate) { 32A + 256B | 8 + 32N | 
{ ; 4 { ! 
{ Arrays (N is no. of elements and {f{ { 1 
{ D is no. of dimensions) { 2048N+8D | 1648N4+16D | 
! | { { 
{ Prog. Defined Data Object { t ( 
a (N is no. of fields) { 8 + 8N { 8 + 8N | 
| , { { { 
{ Table (E is no. of items in { { | 
{ the table and I is the initial | { | 
{ first argument to the TABLE { { { 
| function) { 12424E+4T | 8+16E t 
t | 
{ 
{ 
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Cee ee ee 
£#% he Inner Loop { It is characteristic of many programs 


£ ———————'_ that approximately 90% of the time is 
€ { spent in 10% of the program. This is true of SNOBOLS 
% | itself and it tends to be true of programs written in 
% | the language. Whether or not the topology of the 
u———J program merits the epithet, the point or points 
within the program where most of the time is spent is called 
the ‘inner loop'. While the SITBOL system has an automatic 
method for determining which statements are responsible for 
the most time, most SNOBOL4 systems do not. There do exist, 
however, certain tracing tools which may be used to examine a 
program's behaviour and extract at least approximate timing 
information. 


{! 1] LPROG() will return the length (i.e. the 
(1 11.5 1 number of statements) in the SNOBOL4 program 
tt 1 in which it is called. LPROG will actually 
bd cause one more statement to be compiled at 
run-time so that its repeated use will return slightly dif- 
ferent values. If new code is compiled in the interim, the 
value returned by LPROG will be augmented by the number of new 
statements 


DEFINE ("LPROG() ') : (LPROG_END) 


ge a en en ee 
{ Entry point: Compile a statement and return 1 less than | 
{ its statement number. { 
[| 
LPROG 3<CODE(' LPROG = &STNO : (RETURN) ') > 
LPROG_END 


Epilogue 


LPROG has intrinsic interest of its own as well as being a 
useful, if not essential, tool in constructing an array to 
record a program's profile (as we shall see). 


{{ Program "1 FPROFILE is a program which determines the 
1 11.6 it number of times each statement is executed 
tt | in the program in which it is embedded. 
ee This is called the frequency profile of the 
program. The statistics gathering begins when the initializa- 
tion section of FPROFILE is executed and tracing is turned on. 
Hence FPROFILE is normally placed before the program to be 
monitored but must be placed after the LPROG function which it 
calls during initialization. For each statement executed after 
tracing has been established, FPROFILE is called and a tabula- 
tion is made in an array (FP_ARY). At any given time during 
the course of execution, statement number N will have been ex- 
ecuted FP_ARY<N> times. 
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DEFINE ('FPROFILE() ') 


pe a np Ee me ee EO ag ee ee eg eee ee 
{ Allocate an array to gather statistics and set uP tracing | 
{ on the keyword &STCOUNT. | 
|e | 


FP_ARY = ARRAY (LPROG()) 
TRACE (.STCOUNT, 'KEYWORD!,, 'FPROFILE') 
STRACE = 1000000 : (FPROFILE_END) 


co eOO= S30I<_0«_R™O >DDyxrTLl-_ 
{ Entry point of FPROFILE (called at each executable | 
| statement). { 
Dee ee ea reap areata 
FPROFILE FP_ARY<&LASTNO> = FP_ARY<&LASTNO> + 1 : (RETURN) 
FPROFILE_END : 


Names_referenced Name Type Where defined 
by_FPROFILE: LPROG * Function Program 11.5 


* indicates name is referenced in the initialization section. 


CS 

{{. Program tt A time profile of a program indicates the 
t{ 11.7 1 relative time spent in each statement. In 
{{ TPROFILE {] a language like SNOBOL4, where there is a 
t_—______--___1 relatively high variation in the time re- 


quired to execute any given statement, a time profile is much 
more desirable than a frequency profile. 


TPROFILE, a modification of FPROFILE, allocates to the state- 
ment just executed the difference between the current time and 
the last previous time. Unhappily, the time required to gather 
the statistic may be z3 large or even larger than the time 
being measured. However it is likely to be more valuable an 
indicator than FPROFILE and in many cases can give aé_esur- 
prisingly accurate time profile. 


DEFINE ('TPROFILE() S,T*) 


Re a re ge eR TE a te OR OE ee Oe eee eee ee 
{ Set up tracing. Times are tabulated in TP_ARY. TPROFILE | 
{ will be called at the start of each statement to be ex- | 
| ecuted. . { 
(0 eereena esses esses en-us Sp > shrine = hss sets i iubndinsnerestshesusubasesuadnaceeseareumrenerenapsrceanell 

TP_ARY = ARRAY (LPROG()) 

TRACE (. STCOUNT, 'KEYWORD!',, 'TPROFILE') 

STRACE = 1000000 : (TPROFILE_END) 
ee en ee ee 
{ Entry pecint: Save the statement number (S) of the state- | 
{ ment about to be executed and quickly obtain the time (T). | 
{ Augment TP_ARY according to the last interrupted state- | 
{ ment. | 
a fan eu ras terms mln cise mn ideal 
TPROFILE S = &LASTNO 

T j= TIME() 

TP_ARY<LAST_STNO> = TP_ARY<LAST_STNO> + T - LAST_TIME 

LAST _STNC = S 
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LAST_TIME = TIME() : (RETURN) 
TPROFILE_END 
Names referenced Name Type Where defined 
by_TPROFILE: LPROG * Function Program 11.5 


* indicates name is referenced in the initialization section. 


Epilogue 


To test the two profiling programs, the function BNORM (Prog. 
10.1) was used. It was passed a string of approximately 120 
characters containing 10 BSPACEs and two USCOREs. TO average 
out noise effects, BNORM was called 250 times. The results of 
applying FPROFILE and TPROFILE to the program are shown in 
Figure 11.3. 


The data was collected on the SITBOL system so that a 
comparison could be made with a ‘truet time profile as 
provided by a built-in facility. Figure 11.4 shows the results 
of turning on the built-in profiler. As might be expected, 
the times are a little higher for TPROFILE than they are truly 
since each statement executed is accredited with a little of 
the overhead used to gather the statistic. But the results 
are surprisingly close due to the relatively small amount of 
time required to execute a simple assignment statement. 


For running TPROFILE on SPITBOL it is imperative to obtain the 
TIME() before S&LASTNO because the latter represents a _rela- 
tively slow operation. Exercise 11.11 provides a method of 
doing this. 


CeCe MELE eRe a ee eee eee eee ee ee See ee a ee 
PPP 22222 IPIII2AA2A2PAIII2F «EXERCISES 22727227 222727222722 22272 227272727 
MEESTER SLE eT EE EA EE SE ee ea eee aa a ee 
cc ae 2 s 2 e 

| Exercise 11.1 | Which of the following linguistic 


L______—-——--I._ facilities require a run-time symbol table? 


(a) Pattern Matching 

(b) a Sort facility 

(c) Run-time compilation 

(da) Redefinition of functions 

(e) Go to a label whose name is computed 
(f) call a function whose name is computed 
(g) Linked-list operations 


le 
| Exercise 11.2 {| Each method below for computing hash num- 
L.—____-________.-J bers has at least one flaw. Indicate 


whether it is too time-consuming (T), does not provide a good 
spread (S) or is not repeatable (R). More than one letter 
might be applicable. Assume each character is an 8-bit code 
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Figure 11.3 


The result of applying FPROFILE (above) and 
TPROFILE (below) to 250 calls to the BNORM func- 
tion. The numbers below the bars refer to state- 
ment numbers in BNORM. Times are in seconds. 
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Figure 11.4 


The histogram above shows the ‘true' time profile 
of the program run to produce the histograms in 
Figure 11.3. Times are given in seconds. 


which represents some integer between 0 and 255. L is the 
length of the Hash Array. 


(a) 


(b) 
(c) 


(a). 


(e) 


(f) 


cm 


Multiply all the characters together ignoring overflows. 
Then divide by L and use the remainder. 


Divide the size of the string by L and use the remainder. 


Let L be 256 and choose il ae the first character as the 
hash number. 


Let L be 256 and Exclusive-OR all the characters 
together. 


Add the size of the string to the last previous hash num 
ber and divide by L, using the remainder. 


Use the machine address of the first character of the 
string. 


ae ea eres | . 
| Exercise 11.3 | As indicated in the text, compilers can be 


i_________-_-____—J ranked from Type 0 to Type 4. Each increase 
in compilation complexity brings about a decrease in run-time 
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flexibility. What type of compiler is required to implement 
each of the following language features in a reasonably 
straightforward way. For example, if your answer is Type 2, 
then all compilers of Type 2 and lower should have no special 
difficuity implementing the feature. By type 3 assume that 
the decision to push a value or a pointer to a variable is 
made at compile time. 

(a) Run-time modification of operator precedence 

(b) A Sort function. 

(c) Redefinition of SNOBOL4 functions 

(d) Redefinition of SNOBOL4 operators 


(e) Run-time modification of the meanings of characters 
(E.g., hereinafter R is an operator). 


(f) Declarationless variables 

(g) Recursive functions 

(h) Amnetine trace requests on variables 

(i) Run-time macros (hereafter all strings in the text of the 


program of the form X shall be regarded as string Y). 


CS pe 

| Exercise 11.4 {| Which of the following facilities are more 
J. likely to be associated with a floating 
form Of storage management and which with fixed storage? 


(a) Declaring a variable to be string and giving it a maximum 
length. 


(b) Arrays containing arbitrary and mixed datatypes. 

(c) Garbage Collection. 

(d) Functions which return arrays. 

(e) String assignment implemented via copying. 

rr area em 

| Exercise 11.5 {| Give an example of a statement which if 


i_____________# timed using TIMER would result in an in- 
finite loop. 


| ie antares eae, | . 7 

{ Exercise 11.6 | Modify RESOLUTION (Prog. 11.1) so that it 
UI averages ten attempts to obtain the resolu- 
tion. Make sure the computation is done once and not at each 
call. 
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| Pa une ae aan | 
| Exercise 11.7 {| One can define the factorial of n (normally 
L—___.____---—J_ written n!) as follows: 


DEFINE ('F (N) *) 3 ( 
F F = LE(N,1) 1 :S (RETURN) 
F N + F(N - 1) 2 ( 


F_END 


Estimate the time required (in SPITBOL) to compute F(1), F(2) 
and F(n) for arbitrary n. Compare the time required for this 
recursive program with the following iterative version of the 
factorial function. 


DEFINE ('F (N) *) : (F_END) 
F F = 1 
F_1 F = GT(N,1) F* WN :F (RETURN) 
N = N- 1 :(F_1) 
F_END 
SR ae aae eee cana 
{| Exercise 11.8 | You are writing a pre-processor in SNOBOL4 


J. which will examine each line of a source 
statement for the occurence of a special character (say 4%). 
If the special character is there, the program will do 
something interesting. Otherwise it copies the line intact. 
Write an ‘inner loop! that does nothing but read and write and 
check for:'-the existence of the special character. Assuming 
the lines containing the special character are relatively 
rare, the speed of processing approximates the speed of the 
inner loop. Compute the speed of your pre-processor in state- 
Ments per minute operating in SPITBOL. Assume I/O time is one 
millisecond per line. 


Ce ee ae 

| Exercise 11.9 | Since error and trace messages are given in 
L___.__._-_.-J. terms Of SNOBOL4 statement numbers it is 
helpful to have a method of producing such numbers for state- 
ments compiled via the CODE function. Redefine the CODE func- 
tion in an upward compatible way so that in addition to 
compiling code it sets the global variable CODENO to the num- 
ber of the statement (or first statement of a sequence) being 
compiled. (Hint: Look at the LPROG function and use the fact 
that SNOBOL4 assigns statement numbers sequentially without 
breaks. Only two statements are required in the body of the 
function.) 


Gore ee gt CA eee 

| Exercise 11.10 | Modify LPROG (Prog. 11.5) so that it will 
J. always return the value it returned when 
it was first called. (Hint: This can be done by the insertion 
of 5 characters.) 
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ee) a ee ge oe 

{ Exercise 11.11 { PROFILE (Prog. 11.7) attempts to obtain 
i__________-__._---—J_ the TIME() as quickly as possible but is 
torn by the fact that the first statement executed must cap- 
ture the SLASTNO. Suggest how TPROFILE can be improved so that 
the TIME() is captured as quickly as possible in the first 
statement without losing the value of &LASTNO. 
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cra 
t—.,— here are n! ways of rearranging (or permuting) nn. ob- 
if jects arid these are referred to as permutations. For 
tf example, there are 3! (=6) ways of permuting the 3 
i! characters of the string 'ABCt as follows 
us 
ABC 
ACB 
BAC 
BCA 
CAB 
CBA 


There is a body of literature on the subject of permutations 
(Algorithms, 1968, p. 829] owing, perhaps, more to the value 
of studying permutations as a computational exercise rather 
than for strictly utilitarian reasons. Yet, the study of 
techniques employed to solve this problem is undoubtedly use- 
ful in discovering techniques for solving more practical 
problems. 


Permutation routines are subject to a variety of different 
ground rules. The object to be permuted may be an array, a 
list or a string. The array may be an array of integers 
{1,2,--e2en} Or an arbitrary array. The permutation may be 
lexicographic; in the case of strings this would imply that 
the permutations are produced in alphabetic order. In general, 
if the objects to be permuted can be compared relative to each 
other (*well-ordered'* in mathematical farlance) a lex~ 
icographic order is defined on the permutation, and some 
algorithms are constrained to produce the permutations in this 
order. Sometimes the objects to be permuted contain duplicates 
such as the characters of ‘MISSISSIPPI' and the permutation 
program is required to produce only those permutations which 
are truly distinct. These are sometimes known as “permutations 
with repetitions" or, as we will call them, reorderings. 
Finally, the permutation wanted may be a purely random one and 
the algorithm for doing that is included in the section on 
Stochastic Strings. 


Grn tg ie oe ee eee i 
{ ®£%% ERMUTATION RECORDS | We will speak in this section of 
1% $ —————————!__ permuting nt1 objects. This may 
1 482% ( seem more awkward than speaking of permuting na ob- 
1% { jects but it will have the advantage of making our 
1% { notation simpler. The number of permutations of n+1 
u—___-4 objects is (nti)! and the reasoning is as_ follows. 
Assume that the objects are selected one at a time in an ar- 
bitrary sequence to be placed in some permutation. The first 
object drawn can be placed in only one way. The second object 
drawn can be placed to the left or the right of the first ob- 
ject; the 3rd object can be placed to the left, between, or to 
the right of the previous 2 objects. In general, the ith ob- 
ject can be placed in any of i different positions and a lit- 
tle reflection will reveal that each position will lead to a 


a different permutation. Moreover, every permutation can be 
obtained by this means. Hence, the total number of permuta- 
tions can be obtained by multiplying all these combinations 
which yields the result (n+1)!. 


record which is important computationally, because most al- 
gorithms depend on some form of this record to record past 
history. Let 


iy is eee in 
be a sequence of integers obeying the following inequalities 


0 


1 
0 2 


iA IA 
© 0 Ofte ps 


For example: 
10 2 4 2 


is a permutation record for n = 5. A permutation record of 
length n can be thought of as representing a permutation of 
n+1 objects as follows: the first object is placed down. The 
second object is placed to the left or right of the first ob- 
ject depending on whether i, is a 0 or a 1. This process is 
continued until the (n+1)st object is placed in the position 
indicated by in. 


For some applications it is convenient to speak of the "Ith 
permutation" of n+1 objects where I ranges from 0 to (n+1)!-1. 
The integer I can be related to a permutation record as 
follows: 


I= ig + ig (2!) + ig (3B!) + oe. + in (n!) (12.1) 


Such an TI will be called the permutation number of the given 
record. The permutation record may be regarded as a represen- 
tation in the factorial number system of the permutation num- 
ber [{Knuth, Vol.2, 175 and Pager, 1970]. For example, let i, 
is i3 = 10 2. Then 
ft 1+ O0(2!) + 2(3!) 
1+ 0+ 12 = 13 


Thus every permutation record yields some permutation number. 
But is that number unique, or will two different records lead 
to the same number? We will show that not only is there a 
unique record for each number but that the record is easily 
reconstructed. First, note that 2 divides every term on the 
right hand side of (12.1) except the first so that 
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i, = REMDR(I,2) 


To determine the remaining n-1 elements of the permutation 
record , set I, = (I - i,)/2 so that 


I, = ip + ig (31/2) + 2... + in(nt72) 


In this equation, each term is divisible by 3 except the first 
so that 


io = REMDR(I,,3) 
This process of division and remaindering can be repeated un- 


til all coefficients have been obtained. Hence, given a number 
I, the permutation record can be deduced. 


a a a te 
PERMUTATION(S,1) will return the Ith 


eZ 
11 Program tl 

tt 12.1 | permutation of the string S where I is a 
1{ PERMUTATION {| permutation number as defined above. If 
t___--____-_________-_J I is 0 then the permutation is equal to 


S itself. If I > N! where N = SIZE(S), then PERMUTATION will 
fail. Note that we can obtain all permutations of a given 
string in this way provided N!-1 < the maximum integer. on 
the IBM 360, with a maximum integer of 23!-1, this amounts to 
the restriction that N<12. This seems rather severe and Exer- 
cise 12.11 suggests a remedy. Note that if one were cycling 
through each permutation of a set of objects one would be bet- 
ter advised to use a routine specially designed for that pur- 
pose (such as PERM, Program 12.2). 


a ag pn ca re Ee gh nS EE en TG re ea ee ee ee ts Ne ee ee 
[| PERMUTATION (S,I) will return the Ith permutation of the | 
{ string s. | 
ae net eset eh SPE ss ihe ssrA nisi eneneeesnenenaenesnrell 
DEFINE (*PERMUTATION (S,I) RADIX,T,S1,N') 
: (PERMUTATION_END) 


Ge en Oe ee Oa ee a ee EE es Se ee et, Le a ge ST ie ete eS © ag? has ee ae 
| Entry point and top of loop: If I is 0 or drops to 0 as a | 
{ result of repeated division, return the value remaining in | 
{ S and the characters already accumulated in PERMUTATION. { 
| Cnn ne | 
PERMUTATION ; 
PERMUTATION = EQ(I,0) PERMUTATION S :S (RETURN) 

a pee a oe Re ay gs age: Peers pe ee Pe ie ae ce ete pee Fee or pen oaths Whe; Pn hge sve  Viges ot ees ew 
| Otherwise remove the next character of S (calling it fT) | 
{ and insert it into the position determined by the next | 
| value (N) of the permutation record. If no T could be | 
{ found then fail because this means I was too big. { 
i a a a tg 


S LEN(1).T = : F (FRETURN) 
RADIX = RADIX + 1 
N = REMDR(I,RADIX) 


PERMUTATION RTAB(N) . S1 = S1T 
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I = I / RADIX : (PERMUTATION) 
PERMUTATION_END 


Epilogue 


Characters are inserted one at a time into the string 
PERMUTATION in a position depending on the value of the per- 
mutation record. The value indicates a number of characters 
from the right because in this way a 0 permutation and only a 
0 will result in an identity operation. 


PERMUTATION is not well suited for arrays (as it stands) 
because insertion of an object into an array (while neighbors 
are moved apart) is not a natural operation. Instead of in- 
terpreting each element of the permutation record as an inser-.| 
tion point, each value can be regarded as an interchange 
distance, as follows. Interchange A<2> and A<1> according to 
the value of i,. That is, interchange 


A<2> and A<2-i,> 


Then interchange A<3> with A<3-i,>. Continue in this way until 
A<n+1> and A<n+1-iyn> are interchanged. 


Can all permutations be obtained in this way? By a bit of 
backward reasoning we can conclude that they can. From the 
position in the permuted array of the last element of the 
original array one can determine the value of in. Hence’ the 
scene as it existed prior to the last interchange can be 
reconstructed. Continuing in this way, the entire permutation 
record can be reconstructed. That means that every different 
permutation record gives rise to a different permutation. But 
there are n+1! permutation records and hence all permutations 
must be obtainable. 


| t Although the function PERMUTATION can yield 
W 12.2 {{ a particular one of a class of permutations, 
| tt it is not particularly well suited for cy- 
t__——_____-_-___-—4 cling through all permutations of a given 
set of elements. This is because each permutation is generated 
freshly. It is more efficient to continually modify the last 
permutation to obtain the next. Trotter [1962] produced a 
scheme in which only one interchange per call was necessary to 
obtain each permutation. His method is basically as follows. 
Imagine the objects to be permuted to be arranged from left to 
right and numbered from 1 to n. Interchange objects 1 and 2 
to produce a new permutation. Then interchange objects 2 and 
3, 3 and 4, etc. In this way the object which had been on the 
left will swing in daisy chain fashion over to the right. When 
it reaches the right side it stops, the n-1 objects to its 
left are permuted once and, on subsequent calls, the last ele- 
ment is daisy-chained back from right to left. When it reaches 
the left, the other elements are again permuted and the 
process repeats. One needs a permutation record of sorts to 
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record this movement and this is done as follows. I, contains 
the position of the Ist element among the other (n-1) ele- 
ments. Iz holds the position of the 2nd element among the 
other (n-2) elements, etc. (A separate array can hold +1 to 
denote direction of movement.) This system has the nice 
property that most permutations are done by a single test, 
increment, and interchange. The programming can be simplified 
by the use of recursion (not originally given by Trotter) 
without significantly adding to the time (see Exercise 12.12). 


PERM(A) uses Trotter's algorithm to cycle through every per- 
mutation of a singly dimensioned array with lower bound 1. The 
first time PERM is called the array is not modified but 
initialization is made. The initial value of A is regarded as 
the first permutation. On subsequent calls, the argument to 
PERM (presumably the same array) is permuted. Finally, when 
no more permutations remain, PERM will fail and reset itself 
to its initial state awaiting a new array. 


ee eee en eT ee ee re 
| PERM(A) will permute the elements of the array A, failing | 
{ when no more permutations remain. A is assumed to have at | 
{ least 2 elements. { 
Mah sensi aces spec ei te dit ci rie ei etc ims encase camhisaaaleehtaiaionspeniel 


DEFINE (*PERM (A) ', '"PERM_INIT*) 3 (PERM_END) 


ee OR ee ee ee ee a eae ee ee ee ee tet eee yee he oe 
{ PERM_INIT is the entry point on the first call to PERM. | 
| First obtain the size of A (by converting prototype to in- | 
| teger) and retain it for future reference in the global | 

{ variable SIZE_A. { 
| en ne | 


PERM_INIT SIZE_A = +PROTOTYPE (A) 


{ Set up arrays to indicate location and direction of move- | 
{ ment of elements. Initialize location arrays to 1 because | 
| every element starts in 1st position relative to remaining | 
{| members. Initialize direction array to 1 to indicate | 
| rightward movement. -1 indicates leftward movement. | 
ees cutest hf sn ps rs -hssssh esp S spt Sfho e sSNR 
LOC_ELEMENT = ARRAY ('0O:' SIZE_A - 2, 1) 
DIR_ELEMENT = ARRAY('0O:' SIZE_A - 2, 1) 


{| Redefine the entry point. All outside calls will have one {| 
{ argument so that I and OFFSET will initially have the | 
{ value null. When PERM is called recursively I and OFFSET | 
{ are given different values. I represents the item to be | 
| permuted and OFFSET represents the extent to which the | 
{| subpermutation of elements I, I + 1, ..., N- 1 is offset | 
{ from the overall permutation. | 
eine annette ener atienrm mhes ni isn asorenstesnatenessuessiethetnasbess i nseusmn-musinnstlaieaidentbuheanams-easnnaintnssnesamsemssmmeinacosl 
DEFINE ('*PERM (A, I, OFFSET) RL, D,LIMIT,AL') 3: (RETURN) 
Ce es ee a ee RET OL gE ee Set See ee ae ge a ea 
{ Steady state entry point: Determine the relative location | 
{ (RL) of the Ith element in the subarray and the direction | 
{ (D) in which it is moving. Also determine the LIMIT of | 
{ travel in this direction. If the limit has. been reached, | 
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{ go to PERM_1. 
a a ai i me ie email 


PERM RL = LOC_ELEMENT<I> :F (FRETURN) 
D = DIR_ELEMENT<I> 
LIMIT = EQ(D,1) SIZE_A - I 
LIMIT = EQ(D, -1) 1 . 
EQ(LIMIT, RL) :S (PERM_1) 


a a i a a aN EIR TE EEECH, | 
{ Determine the absolute location (AL) of the Ith element, | 
{ swap elements, update location vector, and return. { 
| ESR ect Ee ne ne Pe EE Ee Se | 


AL = RL + OFFSET 
SWAP (.A<AL>, .A<AL + D>) 
LOC_ELEMENT<I> = RL + D : (RETURN) 


RRR a aaa ee | 
{ Reverse the direction of movement of the Ith element. | 
{ Determine the OFFSET of the subpermutation and attempt to | 
{ make the permutation; if success return; otherwise, reset | 
{ entry point and fail. { 
Ms cjnsosseinsiy- is eminem calvin eomeetilcelaceahecianonsiaaiienatanitappsicatiliemsill 


PERM_1 DIR_ELEMENT<I> = <-D 

OFFSET = EQ(D,1) OFFSET + 1 

PERM(A, I + 1, OFFSET) : S (RETURN) 
PERM_F DEFINE ('PERM(A) *, *PERM_INIT'*) : (FRETURN) 
PERM_END 
Names_referenced Name Type Where defined 
by _ PERM: SWAP Function Program 3.14 
Epilogue 


The program is written recursively because this is the way the 
algorithm is described, and because the inefficiencies of 
recursion will not manifest themselves in substantially slower 
programs. A difficulty involved in specifying the function 
recursively was that the recursive call is to permute an array 
which does, not exist in isolation but only as part of a larger 
array. Hence, we must give additional information such as the 
OFFSET Of the start of the array with respect to the larger 
array and I, the level of the item to be moved. The OFFSET 
and level have been defined in such a way that the outer call 
should be made with these values equal to 0. Hence if the user 
ignores them which he is instructed to do and passes only one 
argument, the array, he will get the correct results. 


Ceo en ee ee 

{1 Program {| Although PERM can be modified to permute 
tt 1263 tt strings, we here seek an algorithm 
t! . specifically intended for use with the 
_—_———_—_——_-——4 string data type in hopes of obtaining 
something simpler if not more efficient. As we recall from 
‘Chapter 3, a permutation can be regarded as a positional 
transformation and hence can be programmed to run rapidly via 
the REPLACE function. Thus if P(S) is a permutation of the 
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string S and if X is the first n characters from 6ALPHABET 
where n is the size of S, then 


REPLACE (P(X), X, S) 


will be equal to P(S). The difficulty, it would seem, is that 
in order to obtain P(S) we need construct the permutation 
first. But this difficulty can be surmounted by the following 
consideration. Let 


S, =  REPLACE(P(X), X, S) 
S> = REPLACE(P(X), X, Sq) 
S3 = REPLACE(P(X), X, So) 


etc. Each consecutive permutation is obtained by permuting 
according to P the last previously obtained permutation. It 
is customary to denote the compounding of permutations in this 
way by product notation and the repeated application of the 
same permutation therefore is denoted by exponential notation 
as: 


S, = P(S) 
Sp = PP(S) = p2(s) 
S3 = P3(S) 


etc. One interesting question is: does there exist a permuta- . 
tion P for which its various powers cycle through all the per- 
mutations. This question is answered by group theory. The 
set of permutations of n objects can ke regarded as the ele- 
ments of a group (of cardinality n!) where the group operation 
is the "multiplication" described above. The question becomes, 
is the Permutation group of n elements cyclic? The answer is 
readily given as no (see, for example, Zassenhaus [1958] ), 
but we can produce almost as good a result by obtaining a 
small set of basic permutations, from which we can produce all 
the others. 


In what follows we will speak of rotating the first _k charac- 
ters of a string one place or simply rotating the first k 
characters to mean the transformation: 


S LEN(1) .C LEN(K- 1). S81 = s1¢ 


In words, the first k characters are picked up, rotated once 
to the left and set down again. Thus, rotating the first 3 
characters of 'ROTATE' yields ‘OTRATE'. Rotating the first k 
characters. of a string is a positional transformation and can 
be done at high speed provided appropriate REPLACE arguments 
have been set up in advance. Let R(k) denote the operation of 
rotating the first k characters of a string. Then R(n) will 
rotate all the characters, and R(1) will do nothing. All per- 
mutations of a string can be obtained by a suitable combina- 
tion of R(i)'s as follows. 


To produce the first permutation apply R(n). To obtain the - 
2nd apply R(n) again. Upon applying R(n) for the nth time, we 


ae ae eae cee era at ea SP ate ee a eee ma See See ene TE SEP SES SO SS AA ED ED SN ETD RS AA A NI ce 


will have produced the original string which of course we can- 
not return. At this point we apply R(n-1) and return the 
resulting string. On subsequent calls R(n) is applied until 
the nth time thereafter at which point R(n-1) is again ap- 
plied. Upon n-1 repetitions of this sequence of events we will 
have returned to the starting point at which time we apply 
R(n-2). So the sequence continues until, at last, there emer- 
ges an attempt to apply R(1). R(1) is a 'no-op' and this is 
the signal that all permutations have keen produced. A per- 
mutation record is used to record the number of applications 
of each type of rotation. 


The idea of obtaining the sequence of permutations by a 
suitable number of rotations was suggested by Peck and Schrack 
( 1962] and suffered from the fact that Trotter's algorithm 
(which appeared later) produced a superior result for arrays. 
But in the case of strings, rotations can be programmed to be 
as efficient as interchanges. Since the computational backdrop 
is simpler for the Peck and Schrack algorithm we will use it 
to write PERMS. We have come full cycle on this one. 


PERMS (S) will permute the characters of the string S. S 
is assumed to be at least 2 characters long and no greater 
than the size of SALPHABET. The argument S should be the 
string which had been returned by PERMS on the last call. 
When no more permutations remain, PERMS will fail. 

Ce nrcenrce on eemapn ae y eins ge tet Ph sr olSfr-r>Ssss  Perr --s-sshpavnsnpesvresmmerrransecsanall 


DEFINE ('PERMS (S)T,N,C,K*,'PERMS_INIT*) : (PERMS_END) 


Co ee ee nn eee ee ee Ok ee ee Go hE oy ee ae ae epee Nw 
| Initialization entry point: N_R<I> will record the number | 
{ of applications of R(I). FIRST_OP is an array such that | 
| REPLACE( FIRST_OP<I>, SECOND_OP, S) will be equivalent to |. 
{ applying R(I) to S. | 


| Ne | 


PERMS_INIT 
N = SIZE(S) 
NLR = ARRAY('2:' N, 0) 
SALPHABET LEN(N) . SECOND_OP : F (ERROR) 
FIRST_OP = ARPRAY('2:' N, SECOND_OP) 
K = N#1 
PERMS 11 K = K-1 
FIRST_OP<K> LEN(1) . S1 TAB(K) . S2 = S2 S81 
+ :S (PERMS_I 1) 
DEFINE (*PERMS (S) I, K') 
PERMS = S§S : (RETURN) 


Oe ag a ee ee ge Re pe ge a ie Pea oe fie oe aes ey So 
{ Steady state entry point: Initialize K to the size of the | 
{ string. 1 
[a a a ee eT eS | 


PERMS K = SIZE(S) 


Frage ee ee ee ee Ce ee Ge ee a eg ne Tee, Se pe Pe a ee ee ae 
{ Apply R(K); failure implies that kK=1 in which case we | 
{ branch to PERMS_1. | 


crt et pr nee esi serene su vwunanevnesnvaen-ustnas-iiiennesSnaninatauarsenacsesall 
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PERMS_1 
S = REPLACE(FIRST_OP<K>, SECOND_OP, S) :F(PERMS_2) 


We eee ee ee I OE ee ee ET TT fe nee ee a ee ae 
{ Bump N_R<K>; if this number equals 0 mod K we have come | 
| full cycle; decrement K and repeat. Otherwise return S. | 
a 


N_R<K> = N_R<K> + 1 
K = EQ(REMDR(N_R<K>, K), 0) K-1 :S (PERMS_ 1) 
PERMS = §S : (RETURN) 


a ee a ee a a ee nee 
{ If K is 1 no more permutations remain. Fail but ready | 


{ PERMS for next set of permutations. | 
a gel la cee pg hi iis i es eines ap a mii de ae 
PERMS_2 DEFINE('PERMS(S)T,N,S1,S2',*PERMS_INIT*) +: (FRETURN) 
PERMS_END 


Cs att ae ee 

{{ Program {ff We define a reordering of a string S as a 
11 12.4 it permutation which produces a new string. For 
{{ REORDER {1 example, the string ‘AABt has 6 permutations 
(________________ but only 3 are distinct (determined by the 
position of 'B') and so has only 3 reorderings. Reorderings 
are usually more significant than permutations in string 
processing where repeated elements are more common than in, 
say, arrays of numbers. 


REORDER(S,OS) will produce a reordering of the characters of 
the string S where OS is an ordered version of the string S. 
REORDER can be used to cycle through every different string 
composed of the characters of a given string, starting with 
the ordered string OS. It will FAIL when no more strings 
remain. Thus, using Program 3.1, ORDER, to order the string S 
we can print every reordering of S by the statements 


OS = ORDER(S) 
OUTPUT = OS. 
LOOP OUTPUT = REORDER(OUTPUT, OS) :S (LOOP) 


Note that in the above, the previously generated string is 
used as the next input. 


It so happens that ORDER(S) will place the characters of S in 
alphabetic order. It is not necessary to be so strict. In 
fact, all that is necessary is that the ordered string contain 
like characters in adjacent positions. Thus if the string is 
"MISSISSIPPI', then 'SSSSIIIIPPM' will be a suitably ordered 
version. 


The number of reorderings of a string can be. substantially 
less than the number of permutations. Let N be the length of 
a string S having n different characters. Let there be ky 
instances of the first character, kz instances of the second, 
etc. Then the number of reorderings is 


N! 
ky! Ko! eae Kn! 
For ‘*MISSISSIPPI' the number of reorderings is 


V1 
—_—————— = 34650 
4t yt 2! 


It would take about 48 pages to print all the reorderings of 
*MISSISSIPPI'. To print the permutations would require about 
50,000 pages. 


Oe ee ee ee eg mee Tay Gee ee een eT fw oes MO CT ee ee eS eg ee ee 
{| REORDER(S,OS) is used to produce the next permutation | 
{ (with repetitions) of the string S. OS is an ordered ver- | 
{ sion of the string S. It is called recursively. | 
| SS | 

‘DEFINE (' REORDER (S, ORDERED_S) C, FRONT, S1, LAST,D,OS') 

: (REORDER_END) 

a ee a ee en ee 
{ Entry Point: Obtain in C the last character of ORDERED_S. | 
{ If no such character exists, S must be the null string. | 
{ Since this has no reordering, we fail. 
| | 
RECRDER ORDERED_S RTAB(1) LEN(1) . C :F (FRETURN) 
a a SIR I eI, | 
{ Then work any character of type C toward the front of S. | 
{ First remove the characters of type c (if any) that al- | 
{ ready are at the front of S. { 
(ssn cen t-te sy seh SSP 9S SS SS spss sss crane -susnseneaancearsseral 


Ss (SPAN(C) | NULL) . FRONT = 


ge ee ny re ee ie ee RI ee fe ee gre ee ee Ne ee Se gs ee oe 

{ Look for an interior cC and interchange it with its | 

{ predecessor, grouping in with C all the characters ob- | 

{ tained previously in FRONT. If an interior C cannot be | 

{ found, go to REORDER_1. { 

Co assenewenesen cesses envesnesnuaebeness neha sv ss-=v = ssn el sly ssl senerrsensanaenemnmssasnsseeresinel) 
Ss ARB .S1 LEN(1) .D Cc = :F (REORDER_ 1) 
REORDER = S11 FRONT C DS 3: (RETURN) 


Fe ep a ee a ee ee aT ae eg Gg Pe an gee ae 
{ If all characters of type C have been worked toward the | 
{ front, control flows to REORDER_1. Here we recursively | 
{| obtain a new sub-ordering and put all the characters of | 
{ type C on the back end. | 
a cscs en men a si A Se A amet Sl at i a aes races Sunemeninsbiopn Sanaa 
REORDER_1 ORDERED_S BREAK(C) . OS 

REORDER = REORDER (S,OS) FRONT :S(RETURN) F (FRETURN) 
REORDER_END 


Epiloque 


We normally make concessions to the aim of providing the sim- 
plest possible calling sequence, feeling that simplicity and 
convenience are two of the most desirable qualities that a 
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program have. Strictly speaking, the second argument to 
REORDER is unnecessary inasmuch as the second argument can be 
reconstructed unambiguously from the first. But in the in- 
terest of avoiding gross inefficiences the second argument is 
made mandatory. 


aca a rar aa | 

{t= Program |] As we have stated earlier, some applications 
1 12.5 i! require permutations to be lexically 
1 tt ordered. This added restriction complicates 
td the problem of permuting slightly; several 
solutions have been proposed. One by Shen [ 1963] has been 
found (Ord-Smith 1967] to be the "best and fastest" of a num- 
ber of lexical permutation algorithms. It operates as follows. 
Obviously the first permutation is the string in lowest al- 
phabetical order, i.e. the one produced by ORDER. The next 
permutation is obtained by interchanging the last 2 charac- 
ters. It is also clear that the last permutation will be the 
one in reversed lexical ordering as shown below: 


ABCDEF 
ABCDFE 


FEDCBA 


To obtain the next higher lexical ordering we find the smal- 
lest sized .suffix that can be increased lexically. This is 
done by scanning from right to left looking for a character 
smaller than the previous character. This we call the pivotal 
character. All characters to its left must remain unchanged. 
The character moved in (from the right) to take the place of 
the pivotal character must be the next higher character to the 
right of the pivotal character. This is called the replacement 
character. All other characters in the suffix must be placed 
into the lowest lexical state. This is most easily done by 
interchanging the pivotal character with its replacement and 
reversing all characters other than the replacement. An exam- 
ple of this operation is shown in Figure 12.1. 


LPERM(S) will return the reordering of S next higher in lex- 
-ical order. It uses the Shen algorithm modified for SNOBOL4. 
If no lexically greater permutation exists for S, LPERM will 
fail. to obtain all reorderings of a string the previously- 
returned string must be passed as argument; the initial argu- 
ment must equal ORDER(S). 


SR EE A ER MRA AES ER aS SS | 
‘| LPERM(S) returns the next reordering in lexicographic | 
{| order of the string S. ! 
ee Ee | 
DEFINE ("LPERM(S) P,T,X,R,Y,HIGHS') 
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pivot replacement 
{ { 


R c F E D A 


Figure 12.1 


An example illustrating the method used by LPERM 
to obtain the next permutation in lexical order. 


ag Ra A a AEE | 
{ Find the alphabetically highest character. { 
SNES in ver eT PN cee Es ONT carey OUR ee ed eS Ss a ere Te eRe 
S&ALPHABET RTAB(1) LEN(1) . HIGH_CHAR 
: (LPERM_END) 


ae a a ae a ee ee ee eg hee eee ne ee en 
{ Entry point: Reverse the string to make scanning from the | 
{ back end easier. Also place dummy characre®. onto end so | 
{ that unevaluated expressions work. { 
yest eect dnseenaeeaicmney onsen cunts nfs ASP TSS sys sh iin stems ants enes cesnnigncemannsemasnsancanwasirall 


LPERM S = REVERSE(S) HIGH_CHAR 


a a a ee Ta ene ge ce ee ee ae ee ee ee 

{ Look for pivot character (P). If none can be found the | 

| argument was in its highest lexical state. We therefore | 

| fail. | 

[ever er | 
S LEN(1) $ T LEN(1) $ P *LGT(T,P) | :F(FRETURN) 


Oe ee eg APL Oe ROE cee ne Re Cn Re ae ee na eg ee a a ee ne 
{ Search SALPHABET for the set of all characters > P. Call | 
| them HIGHS. Then search S for the replacement character { 
R e 
ae ! 
SALPHABET  BREAK(P) LEN(1) REM . HIGHS 
S BREAK (HIGHS) ~ X LEN(1) .~ R BREAK(P) . Y LEN(1) 
+ = REVERSE(X P Y) R 


sen | 
-| Reverse the entire string back, remove the dummy character | 
j and return. I 
net er amie sss Si sp GES ctnesnepesusshapnuesihstsannannistawsemmenissnnmsnanecesaminll 
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LPERM = REVERSE (S) 

LPERM HIGH_CHAR = : (RETURN) 
LPERM_END 
Names referenced. Name Type Where defined 
by LPERM: REVERSE Function Program 3.6 
Epiloque 


The most single interesting part of LPERM, from the implemen- 
tation point of view is the search for the pivot element. Here 
a search is made for 2 consecutive characters such that the 
first is lexically greater than the second. This is done using 
dynamic assignment (the binary $ operator) and an unevaluated 
expression (¥*LGT(,)). To make this work under the normal 
quick-scan mode, a character had to be appended to S. This is 
because the scanner assumes that *LGT will match at least one 
character (which it does not) and would prematurely fail 
without testing if no more characters remained. The character 
appended (viz. HIGH_CHAR) was chosen in such a way that the 
algorithm will work whether or not the one-character assump- 
tion is made. 


a ra a Ra ETRY 

{{ Program |{f{ A permutation vector is a sequence i, ig ... 
WW 12.6 | in containing one each of the numbers 
t{ IP tft {1,2, eee ¢N}. If P is a permutation vector 
bene (in the form of an array) then AI(A,P), 


where AI is Prog. 4.6, will return an array in which the ele- 
ments of A have been permuted according to P. That is, the 
element in position P<i> will be moved to position i. Let 


B = AI(A,P) 


If Pisa permutation vector there must be another permutation 
vector Q such that A = AI(B,Q). Q is called the inverse of P. 
One description of Q is as follows 


e<j> = k if and only if P<k> = jj 
This suggests that Q can be created as follows 


Q = COPY(P) 
SEQ(' Q<P<K>> = K', .R) 


(SEQ is defined in Prog. 4.3). For very large arrays we may 
find that it is necessary, or at least highly desirable, to 
invert the permutation vector in place and thus avoid the 
creation of additional storage. One way to do this is to 
recognize that every permutation consists of a sequence of cy- 
cles. Thus, the permutation vector (5,3,1,6,2,4,7) will have 
cycles as indicated in Figure 12.2. 
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t I \ | ! 1 
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Figure 12.2 


Figure 12.2 is drawn by directing an arrow from box i to box 
P<i>. For example P<1> is 5 so that an arrow is drawn from 
the first box to the fifth. A permutation vector has the 
property that each box will have exactly one such arrow direc- 
ted in and one directed out. From this it follows that each 
arrow will form part of a closed loop and that the entire 
graph is a collection of non-intersecting closed loops. Thus, 
permutations can be completely characterized by their loops. 
The vector of Figure 12.2, for example, can be described as: 


(Se 2,3, 1) (6,4) (7) 


The inverse permutation can be obtained by reversing all ar- 
rows. This is most conveniently done by reversing all the 
arrows in a given loop much in the manner used to reverse a 
list (REVL, Prog. 5.3). When elements in a given loop are 
reversed they are made negative to indicate their reversal. 


Cite ee ee pe ie PN ee ee ee ee I ar ge Re oe Ee ge ee ee” Pee a ee ee eed Pe ae 
{ IPCP) will invert a permutation vector contained in the | 
{| array P. No additional storage is consumed. { 
crescent essere es Suess ss sf il =e tesstasapocnesnpmsntwansunsesnioesall 


DEFINE ("IP (P) M, PM, K, PK, PPK*) : (IP_END) 


eg a a ee Se ee Te ee ee 
{ Entry point and outer loop: Bump M by 1 looking for a non- { 
{ negative value in P<M>. Such a value indicates the start | 
| of a cycle. Array elements already inverted are denoted | 


{ by negative values. When M runs out, we are done. i] 
| A a le Ce ee en ee Ne | 


IP. M = M #1 

IP = -P<M> P :S (RETURN) 

P<M> = LT(P<M>,0) -P<M> :S (IP) 
re ee ae ee ee ee ee Te ee ee ee a ett ee ee 
{| If PM = M then we have a trivial cycle. Go back. Other- | 


| wise, we let K sequence through the cycle starting at M. { 
ets cc acpi tachment brm-dec eaaiabesedssn 
EQ (P<M>,M) 3S (IP) 
K = M 7 PK = P<M> 
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gg Ge ge ee EE EAC CT Spa EM ORR RTS Pe Tape gO ee 
{ Go through loop setting P<P<K>> = -K. Care must be taken | 
{ to save the value of P<P<K>> before it is overwritten. The | 


{ loop terminates when we arrive back at M. ] 
a Sn an ee Pe ee Oe eo eS IS | 
IP_LOOP PPK = P<PK> 

P<PK> = -K 

K = PK 

PK = PPK 

EQ (PK,M) 2: F (IP_LOOP) 

P<PK> = K ¢ (IP) 
IP_END 
Epiloque 


IP has been adavted for SNOBOL4Y from an algorithm by Medlock 
{1965} and Boonstra [ 1965]. See also Knuth [Vol.1, 175] for 
another inverse permutation algorithm. 


ASE ELEC E SARE EEE CSE ES Se Te a a ae ae ae ee | 
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CS ; 

{ Exercise 12.1 | Give the permutation numbers for the 
_____..___--_-—-—-JI__ records below (provided they are valid per- 
mutation records). 


a) (90 1 2 #1) 
b) (1 2 1 = OQ) 
c) (9 14 2 3) 
a) (1 3 2 4&y 
e) (0 0 0 1) 


Cc]. er US 

{ Exercise 12.2 | Compute the permutation record of the fol- 
Land 6 LOWing permutation numbers: (a) 6, (b) 3, 
(c) 13, (d) 26. 


Cro. ose et ee 

| Exercise 12.3 {| Write a SNOBOL4 program to convert a per- 
_________-——--—J. mutation record in V to a permutation num- 
ber I. Assume the record is a string containing numbers 
separated by commas as in '1,2,1,3,'. 


SS ee ee 

{ Exercise 12.4 {| Define the sum of 2 permutation records as 
i______.________..__J the permutation record of the sum of the 
associated permutation numbers. Write a SNOBOL4 program to 
determine the sum of 2 such records. Assume the records are 
in the form indicated by the previous exercise. 
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Exercises for chapter 12 
so ee ee 
| Exercise 12.5 { Prove that the permutation number of 


aS (1,2,3,+2e,n-1) is ni-1. 


ier en ee 
{ Exercise 12.6 {| The permutation number can alternatively be 
L—____._._____-I defined as 


I = ig (nt/1t) + ig(nt/2!) + 2... + in(nt/n!) 
Devise an algorithm to extract the record given I. 


Carey 

| Exercise 12.7 { On the first time through the loop of 
I$’ ~—«s«s PERMUTATION what will be the values as- 
Signed to RADIX, N, S1 and I? 


Ga ee ee ee ee 

{ Exercise 12.8 {| What is the associated permutation record 
J of I and what value is returned by 
PERMUTATION ('ABC', I) as I ranges from 0 through 5? 


Ce ee ee 
{ Exercise 12.9 { Let S be a string of 6 characters. Obtain 
u—_______-__._____.I. the reverse of S by a call to PERMUTATION. 


er ee ee ee 
| Exercise 12.10 | Rewrite PERMUTATION to operate on arrays. 
| | 


=. 

| Exercise 12.11 { In the call to PERMUTATION, one may escape 
__________..______J the problem of limited arithmetic preci- 
sion by denoting the permutation number as one long string as 
in 


PERMUTATION (S, '32564117246785') 
Assuming that the length of a string is no greater than the 


largest integer what statements within PERMUATION would have 
to be modified to permit these extended integers? modify them! 


cc. 

| Exercise 12.12 { Let C(n) be the average number of calls to 
_______-_____J_ PERM (both external and internal) per per- 
Mutation of an array Of n elements. For example, if PERM were 
non-recursive, C(n) would be 1. 


(a) Write an expression for C(n) in terms of C(n-1). 


(b) Assuming that C(1) = 1, use a) to compute C(2), C(3) and 
Cc (4). 


(c) Prove that if C(n) < C(n-1) then C(n+1) < C(n). 
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(d) On the basis of (a), (b) and (c) what value does C(n) ap- 
proach as n approaches infinity? 


(e) What conclusions can you draw with respect to the use of 
recursion to program PERM. 


Ce ee ee 

{ Exercise 12.13 { PERM can be extended to handle the special 
L________-__-____-I case of arrays of length 1 by the inser- 
tion of a single instruction. What is the instruction and 
where should it be placed? 


| eae ran a | . 
{| Exercise 12.14 | what error in PERM will arise if its argu- 
—______________-3_ _ ment is an array with only one element? 


Ae ce, aa a ee ee 

{ Exercise 12.15 {| PERM may be modified to permute a global 
___________-__-_I string (say G_S) rather than an array by 
changing cnly two statements (in addition to perhaps adding 
temporary variables). What are they and suggest modifications. 


| nal err mara ea, | 
{ Exercise 12.16 { Modify PERMS so that if it is called with 
L______________s the null string it will be reset. 


ap ee ee ee ee 

| Exercise 12.17 { In using PERMS to permute the string 
____-______.___--1 'LEMON', let us denote 'LEMON' itself as 
the Oth permutation. The next value returned is called the 
first permutation, etc. What number permutation is (a) ‘MELON! 
and (b) "EMLON'? 


ee ee ee a 
| Exercise 12.18 | Give the smallest sequence of k-rotations 
L—___—_-----_-_-_-—1 (denoted R(k)) to permute the characters 


'LEMON' to "MELON'. 


Fe ee ag a ee 
| Exercise 12.19 | How can REORDER be modified so that it re- 


______________--1._ quires Only one argument. Assume that the 
first string given is in alphabetic order (as returned from 
the ORDER function). 


ss 
| Exercise 12.20 { Write a function REORDERING(S,I) which 
—_—___________J. will return the Ith reordering of the 
string S. That is REORDERING(S,0) will return ORDER(S), etc. 
Pattern the function after PERMUTATION (S,I). DO not merely 
call REORDER I times as this would be grossly inefficient. 
Hint: the number of ways of interspersing K identical charac- 
ters into the n+1 positions of a Sang of length n is given 
by the binomial coefficient: 


eee! Exercises for chapter 12 on Page 273 


n+k (nt+k) ! 

c = eeeene-- 

k n! k! 
Cat eee eee 
{ Exercise 12.21 | Will the function LPERM (Prog. 12.5) 
L____-__-___-__-_5 produce all permutations or all 


reorderings of a string with repeated characters? Why? 


Croce sc pe ee eee 
{ Exercise 12.22 {| Permutation vectors may be regarded as 
______.________-J. elements of a group under what operation? 


YS ee 
{ Exercise 12.23 { Let I be the identity permuation of n ele- 
Ce rere ener ments. : That is I = {1, 2, eee eD}o Let Pp 


be an arbitrary permutation vector and Q be its inverse. What 
is the value of (a) AI(P,I), (b) AI(I,P), (c) IP(I), and (4d) 
AI (P,Q)? 


CHAPTER THIRTEEN 


SS. eS Se sO ee oe 
te— te tet fe oe OTST te 
Poe Te te 1 a TINNE tle 
haf ff th INS 11 11 Hi Nt tht 
at) ee PINS 1 PS th it rel 
| ee | a | us as | es | as tJ ae 


CONTENTS 


BSORT er ee 13.1 
HSORT: 26545 sau codesetwee 1302 
ESORT sis bedecs sa teusees: A923 
MSORT: s.scsdsveroasoueseac 1958 
FRSORT’ scesecsscececdads, 1955 
TSORT sUGs cbcweawse cues 1956 
SSORT ccccccccccccccsces 1307 
INSERT Scan dadetich tee: 13.8 
LINEARIZE: ssescesccscsee 1369 


INSERTB @eeeoeeaeeseaenvneneeaaneeene 13.10 


a 


aise Chapter 13 - SORTING Page 27 
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te—4 Orting on a digital computer covers a wealth of ap- 
{[t— plications, can involve a variety of data structures 
_—, | and devices, and has been met with a host of tech- 
e— 1 niques. Sorting has been widely used in business 
t——4 applications where payrolls, accounts, inventories and 
lists of all kinds must be sorted by name, number, address, 
etc. But, in addition, many other data processing applications 
find a need for sorting. Examples include compiler writing 
where symbols are sorted in alphabetic order, in computational 
linguistics where dictionaries, indexes and ccncordances are 
prepared, and in systems programming where libraries are al- 
phabetized for rapid searching. When the items to be sorted 
can fit entirely in core storage, the process is called 
internal sorting. When secondary storage is required, it is 
called external sorting. This chapter is concerned with in- 
ternal sorting methods only. External sorting is generally 
only done when the amount of data to be sorted is large. Under 
these circumstances, SNOBOL4 is not the ideal language for ef- 
ficiency reasons. 


The aggregate of things to be sorted internally may be an 
array, a list, a string, a tree or a table. The ordering may 
ke on the basis of numerical value, lexicographic value or 
number of occurrences and the ordering may be forward or 
reverse. A routine may be required to actually sort an array 
or merely return an array of indices that could then be ap- 
plied to one or more arrays. For these reasons and others to 
follow there is no one universal sort routine. Rather, each 
situation tends to be special and tends to require a sort 
tailored for the application. 


The distribution of the input items may not be very uniform. 
There may, in fact, be strong correlations present in the to- 
ke-sorted aggregate which, if taken into account, could im- 
prove the sorting time. Not all algorithms are equally adept 
at taking advantage of an almost-ordered input array. With 
some algorithms, almost-ordered data can actually adversely 
affect sorting time. 


Another factor associated with the distribution which can in- 
fluence the choice of sorting algorithm is the degree to which 
there is repetition in the data to be sorted. For example, in 
the preparation of a book index or a word concordance, the 
number of repeated items is high. There are sorting techniques 
which work quite well in such circumstances and their use can 
reduce sorting times substantially for this kind of problen. 


The sorting situation is somewhat influenced by the nature and 
amount of so-called passive information which must undergo the 
same permutation as the input array, but which does not _par- 
ticipate in the determination of the new order. For example, 
if we are sorting the rayroll by location we presumably want 
to bring along with the location other passive information 
such as name, payroll number, salary, etc. Such ancillary in- 
formation may take many forms. The passive information may 
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appear in a separate array. Or the active information may be 
embedded in the passive information as for example when card- 
image strings are to be sorted on the basis of certain 
columns. Or the passive and active information may appear as 
fields of programmer-defined data objects. The way in which a 
sorting method handles equal items may be crucial in certain 
applications where passive information is present. 


The reason that sorting is done at all is usually to 
facilitate later lookup by either man or machine. Imagine the 
difficulty one would have if all the names in the telephone 
book were scrambled chaotically. To search the telephone book 
for an entry we would have to make what is called a linear 
search comparing each name one after the other until the 
desired entry was found. The time required would be, on the 
average, the time to make n/2 comparisons, where n is the num- 
ber of items in the book. On the other hand, if the book is 
alphabetized we can do a so-called binary search. We can look 
at the middle item and decide whether the desired name occurs 
after or before this middle item. Regardless of the outcome 
of this initial test, we can again probe the middle element in 
the segment known to contain the name and, in such a way, nar- 
row the search by half at each comparison. The number of com- 
parisons in this latter case is logon. When n is large the 
difference between logsn and n/2 is truly impressive. For n 
equal to 10000, loggn is only 13 whereas n/2 is 5000. 


An appreciation of the difference between a quantity which 
grows linearly (such as n/2) and a quantity which grows 
logarithmically is needed to understand the significance of 
some sorting methods and some formulas expressing their com- 
putational requirements. To further underscore the distinction 
between linear and logarithmic growth, the latter quantity 
grows only as fast as the number of digits needed to express 
the former. Thus logan not merely grows more slowly than n 
but becomes extremely sluggish as n grows large. . 


As we have outlined here, there is a rich variety in the kinds 
of sorts that one might be called upon to make. We will not 
try to give a complete and exhaustive set of programs which 
.could handle every conceivable situation. We will, rather, 
present a few general methods, and give a few specific exam 
ples and hope that either these, or suitable modifications of 
them, will serve any given sorting need. 


More complete sources of information on sorting are available. 
Flores [1969] and Knuth [Vol. 3] have written books on the 
subject. An entire CACM issue has been devoted to sorting 
{Sorting Issue, 1963}. An excellent early summary of sorting 
techniques is given by Friend [ 1956]. A recent bibliography 
‘is given in Lorin [ 1971]. 


Sorting methods generally subdivide into two categores, inter- 
nal and external. The internal sorts are subdivided again into 
two categories, comparison sorts and distributive sorts. 
Generally speaking, comparison sorts sort on the basis of 
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pairwise comparisons between elements. Distributive sorts are 
anything else. 


GO a re a ee ee Te ° 

£#£% OMPARISON SORTS {| A comparison sort works by succes- 
ee —S ss sively comparing pairs of items to 
% {1 be sorted. The values of the items are irrelevant 
% { other than as to how they compare with each other. 
{ Thus, a comparison sort will operate in precisely 
is the same way if one is sorting strings or numerical 
values. Indeed, a comparison sort can be used effectively to 
sort data objects of any kind provided an operation can be 
written which compares the two items. 


Before considering the various methods of sorting it will be 
well to obtain some idea of the basic computational neces- 
sities involved in a comparison sort. If we assume that every 
permutation of the input array is equally likely, then we can 
use an information-theory argument to determine a lower bound 
on the average number of comparisons needed. There are n! ways 
of permuting n objects. Therefore the input array (of length 
n) can be thought of as encoding a message containing 1logp,n! 
bits. Since one comparison yields one bit of information and 
since in order to sort we need complete information concerning 
the permutation, we may loosely conclude that at least. logsgn! 
comparisons are needed on the average. Using Stirling's ap- 
proximation formula (Knuth, Vol.1, p.46] we obtain 


-5 n+.5 -n 
logen! (appr.-) = logg(2 PI n e ) 


= 1.33 + n logen + .5 logegn - 1.43 n 


(appr.) n (logan - 1.43) 


Moreover, for large n (say n > 1000) 
logen! (appr.) = n logan 


The information theory argument may be made rigorous by the 
following line of reasoning. Suppose we wanted to communicate 
to a distant location the contents of a permutation vector P. 
If FP has n elements and if all permutations are equally likely 
then this will require log gn! bits (on the average). That this 
is true is intuitively plausible. For a more general and 
rigorous treatment of the subject consult any textbook on in- 
formation theory. For example, see Reza [1961], p.148. This 
granted, assume that we have a comparison sorting algorithm 
(Algorithm S) which uses a predicate COMPARE(X,Y) to obtain 


information about the array it is sorting. But no other in- 
formation about the value of the elements of the array are 
available to S. If we allow Algorithm S to sort P it will 


transform P into I, the identity permutation vector 1,2,...,n. 
Now at a distant location set up Algorithm S to sort the ele- 
ments of I using the comparison bits tapped from the sorting 
of P. This setup is shown in Figure 13.1. The result of this 
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is that I is transformed into the inverse of P so that we have 
effectively transmitted P: Since the information transmitted 
must be at least logan! bits on the average we know that we 
must have at least log,gn! comparisons on the average. 


Communication 
1: | Link —V—C ay 
| { ateiwie eo eiere e160. eosie ses ete oe ieee ee 1 
ca =| { : c—— ~~ $ { 
{ | { : t { | 3 | 
{ t—-t Xx { 3 { Ii—{ xX 3 1 
| {i—_It-. rc SCOA#*CS : t t—I- oe A 
(Pie € 1 1 {| 3 (rtf ft: tf 
{ {. { { COMPARE {——+.[...: { 1. | { COMPARE |{—{ | 
{ 1. I-—I if te | {.- |-—I t 4 
{ I—{ Y -—-—————__ Vv | | i—!| Y -———__ Vv | 
{ Algorithm Ss { { Algorithm sS { 
{ { { { 
ni emerldiniaieatiaemtiitiemtemicall | eee | 


Figure 13.1 


An information theoretic argument for showing that 
sorting requires log gn! comparisons. 


It is important to understand what the formula says. It does 
not say. that we must necessarily make this many comparisons in 
any given instance. We must, rather, make this many com- 
parisons on the average if the permutations are equally 
likely. From this observation we can deduce that if the number 
of comparisons which are to be made is independent of the 
distribution and only dependent on n (the number of items) 
then the method must make at_least logg n! comparisons if it 
is to work for all possible distributions. 


There are four principal kinds of comparison sorts: 


Interchange 
Merging 
Selection 
Insertion 
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eA NE OR ETE DD NEED ESTATE A 


NTERCHANGE SORTING { Given an array, the elements of the 
—_———————————s array can be pair-wise interchanged 
{ until the elements are sorted. This has the advantage 
{ that no additional storage need be allocated. Moreover 
{ no other sort type has this property. But every inter- 
t—_4 change sort has some flaw which makes it unacceptable 
for some applications. 


ee ee es ; 

{! Program {ff The simplest kind of interchange sort which 
tt 13.1 tt is of any interest is the so-called bubble 
1 BSORT VW sort. In the bubble sort the first and 


t_—_—____-________—J second items are compared; if they are out 
of order they are interchanged. This sorts the first 2 items. 
To sort the first K items assuming the first K-1 items are 
sorted we "bubble! the Kth item down through the sorted list 
of K-1 items searching for its correct insertion point. This 
takes an average of approx. K/2 comparisons to insert the Kth 
item and approximately N(N/4) comparisons to sort N items. 
This is really too many, yet the popularity of the bubble sort 
persists. This is due to several factors. The bubble sort is 
easy to program and understand. Also for small N the figure 
N(N/4) is not much greater than N logs N. Hence the the bubble 
sort is reasonably fast for N = 25 or so. But as the number 
of items increases the bubble sort departs severely from the 
ideal. At N = 100, the bubble sort requires 4 times as many 
comparisons. For N = 1000 the ratio is 25. 


Sorting routines, like the bubble sort, whose comparisons are 
dominated by the factor N? are called quadratic. Sorting al- 
gorithms which obey an N logsN law or differ by a propor- 
tionality constant are called logarithmic. Though inefficient 
for large N, a quadratic sort can be more efficient than a 
logarithmic sort for small values of N (less than 10 or _ so). 
For this reason a logarithmic sort may use a quadratic sort as 
.a utility routine for the purpose of handling small arrays. 


For medium values of N the bubble sort can save time if the 
array is almost sorted to begin with. The bubble sort, more 
than most, takes advantage of any pre-existing order in the 
array. , 


Qe gr ee Te ee te ee ee OE aT pe ee Te eT ee 
{ BSORT(A,I,N) will sort (via a Bubble sort) in ascending | 
{ lexical order the strings in the subarray A<I>, A<I + 11>, | 
{ «ee, A<N>. CAUTION: Bubble sorts may be time consuming | 
{| for large arrays. ! 
Nisei eerste caemeertcsrieisimecnseee sangeet tess inline ier santo mtnalacansiiehihdie eel 


DEFINE('BSORT (A, 1,N)J,K,V') : (BSORT_END) 


es eR Ee OER pee ee ge ea me es egy ae peed Ne a eg Oe ge a eee 
{ Entry point: J will hold the index of the item to be | 


| bubbled. ( 
ee a en ae PS er Wee a EP PT er ERE EER EORTC ERE NEE | 


BSORT J = I 
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ee ee a a ee 
| Outer loop: Loop on J. V is the value of the bubble. { 
[A ae er re a ee Se ee | 


BSORT.1 J = 341 I7(J,N) :F (RETURN) 
K = 9g 7 
V = A<d> 


(rrr rier erates ED 
{ Inner loop: Loop on K. We bubble down into the lower | 


| portion of the array looking for a place to insert V. { 
serene rvs sir nem emp iis ei in ar ae iccascanniemitciol 


BSORT_2 K = K-11. GT(K,I) :F (BSORT_RO) 

ACK + 1> = LIGT(A<K>,V) A<K> :S (BSORT_ 2) 

ACK + 1> = V : (BSORT_1) 
Ce ee Se ee OE TT Poe eh OT ee ee E Sgt ype ee eee 
{. On runout, plunk bubble into bottom and go back to outer | 
{ loop. { 
[Sn Oe | 
BSORT_RO A<I> = V : (BSORT_ 1) 
BSORT_END 

pear aes 
Program An interchange sort which is logarithmic 


( 11 

{ 13.2 tt rather than quadratic is one introduced by 
j t1 Hoare [1961] and improved by Hoare [1962] 
en, and Scowen [ 1965}. It is frequently called 
QUICKSORT. The basic idea is to interchange the elements of 
the array until they are partitioned into two groups, A and B, 
such that 


(i) Each element in group A lies lower (i.e. has lower index) 
than every element in group B. 


(ii) Every element in group A < every element in group B. 


Note that A and B need not be equal in size. If groups A and 
B are then sorted separately the entire array will be sorted. 
The sort routine therefore consists of partitioning the array 
followed by two recursive calls to sort the partitions. 


One method of partitioning is to pick the middle element and 
use this as a criterion to separate the lows from the highs. 
The elements of lower index are examined one by one for an 
element that is 2 this criterion. The elements of higher index 
are searched from the top down to determine if any are < this 
criterion. When found the elements are interchanged and the 
search goes on. Eventually the two pointers cross at which 
point the partitioning is completed. 


For each partition there are approximately n comparisons where 
n is the size of the array to be partitioned. Hence the number 
of comparisons is n times the average depth of the recursion. 
Ideally this is logpgn. Hence, ideally the number of com- 
parisons approaches n log sn. But this ideal is reached only 
if the criterion is always chosen so that it partitions the 
array in half. For randomly chosen criterion the figure for 
the number of comparisons is approximately 1.4 n logan [Hoare 


1962}. This factor of 1.4 also shows up in the analysis of 
one of the insertion sorts. (See Exercise 13.13). 


HSORT is not particularly fast for arrays with a small number 
of items. Ideally, when the array is small, BSORT should be 
called. This is explored in an exercise. 


The algorithm given here differs somewhat from Hoare [1961] 
and is such as to reduce the size of the program at the ex- 
pense of a small increase in running time. 


ea ee ee ee Te ee ee OT ee ee 
{ HSORT(A,1,N) will sort the strings in array A<I>, A<I + | 
{| 1>, «..-, A<N> in ascending sequence. HSORT calls itself | 
{ recursively. { 
[SRI rok 5 ce aN a a a ee en ee | 


DEFINE ('HSORT (A, 1,N) J, K, CRITERION') : (HSORT_END) 


Gr re eR ae oe eae Pe ee et Pe Ee Ge ee a oe Se TN 
| Entry point: If more than 2 items remain skip. If only 1 |{ 
{ item is to be sorted, just return. { 
a ae 
HSORT GT(N - I, 1) :S (HSORT_LARGE) 
GE(I,N) :S (RETURN) 
(LGT(A<I>, A<N>) SWAP(.A<I>, .A<N>) ) : (RETURN) 


a RA Na a aI a REE | 
{ Obtain CRITERION to be used for partioning array into 2 | 
{ groups. { 
ne eee i rt sesh sl lS Ss res eu -ssnssonnnpenaeeatnernenreermneatreareeaall 
HSORT_LARGE 

CRITERION = A<X(I + N) / 2> 


a 
{ J will move through the array from the bottom looking for | 
{ an element 2 CRITERION. K will move through the array from | 
{ the top looking for an element < CRITERION. { 


J = I-1 

K = N#1 
HSORT.UP J = J +1 

~LGT (CRITERION, A<J>) :F (HSORT_UP) 
HSORT_DOWN K = K- 1 

~LGT (A<K>, CRITERION) : F (HSORT_DOWN) 


Core te eae te LP ee ate Re pee Se ae ea ES aera ae ee eee ne ee ee, en ae ee 
{ If J is still < K, interchange and go back. { 
| a a eS | 
(LT(J,K) SWAP(.A<J>, .A<K>)) :S(HSORT_UP) 
ee 
| Otherwise, we are done partitioning the elements. K will | 
{ serve as a convenient dividing line. Sorting will be ac- | 
{ complished by sorting oe 2 subarrays. Might as well use | 
{ HSORT to do this. | 
chek me et cng cenit de ce nese cpnndrcr ens mnese ips li om e c eeceaeacii ans cmaeal 
HSORT (A, I,K) . 
HSORT(A, K + 1, N) : (RETURN) 
HSORT_END 
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Names_ referenced Name Type Where defined 
by_ HSORT: SWAP Function Program 3.14 
Epiloque 


A difficulty with the Hoare sort is the possibility that equal 
items will not retain their relative order. In the subroutine 
given, this makes no difference since such an inversion will 
be undetectable by the user. But in sorting structures, for 
example, this property could prove to be a critical defect. 


% ERGING { Merging is not strictly a sorting technique. 
rcoHoH—' It is a technique whereby two sorted ag- 
{ gregates can be combined into one sorted aggregate 
{ by the simple process of selecting and incrementing 
{ the aggregate showing the current least value. But, 
__._-_.§ merging may be converted into a sorting technique 
in the following way. Let the final sorted aggregate of length 
n be the result of merging two sorted aggregates of length 
n/2. Let each of these be the result of merging two aggregates 
of length n/4, etc. Ultimately we reach a point at which the 
aggregates have length 1 and can be regarded as being sorted. 
The merged sort is quite efficient and approaches the 
theoretical lower limit on the number of comparisons needed. 


Co ee ee 
{1 Program {ff The aggregate merged in the merge sort can 
({ 13.3 Ut be any collection of information accessible 
{1 LSORT {1 in serial fashion and hence it is a favorite 
a re ed way of sorting such serial aggregates as 
files and lists. LSORT will sort a linked-list in ascending 
sequence according to the value contained in the VALUE field. 
If HEAD is the head of the linked list then LSORT(HEAD) will 
sort the list and return the new head. LSORT does not allocate 
new storage; it just rearranges pointers. 


Qe rn ee ee Ne ee Nr AGT I Se ee Oy Eee ee ee nee gre 
{ LSORT will sort a linked list L using a merge sort. The | 
{ caller may specify the name of the value field, the next | 
{ field and the predicate. Default names are VALUE, NEXT | 
{ and LGT. { 


nena cen sensu ries sv she ss sr levees 
DEFINE (*LSORT (L, VFLD, NF LD, PRED) 11,12, PTR*) 


er en ra et en a ee RR ee Or ga tee ag RT ee pe ge te ee ee ee ee 
{ LSORT uses the auxiliary function LSORTA which is called } 


| recursively. | 
fA se EEE a ee Oe eI I ne I I IC EE A I Se ET a EN OE Oe NTT! | 
DEFINE (*LSORTA (N) I*) : (LSORT_END) 


Sn Pig a pr ee dg ge ne eee OE Oe nN ee RS ee ee ee ee ene ae 
{ Entry point for LSORT: Give default names. Then make the | 


| fields used in the program synonomous with these. 1 
Ne aevccsasipncsicesionenciciuesampvsssirtioeeinnernetsensisehsempeilnenal spatial aswell in erie rican 
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ee ce er gee ee ener ce ce eee a AR CORI AS | ES GE CS RE SN SRA DCD eee econ 
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LSORT VFLD = IDENT(VFLD) ‘VALUE 
NFLD = IDENT(NFLD) 'NEXT* 
PRED = IDENT(PRED) 'LGT* 


OPSYN ('VFLD', VFLD) 
OPSYN ("NFLD', NFLD) 
OPSYN ("PRED',- PRED) 


SS ee 
{| Calling LSORTA with an argument of 0 will sort the entire | 
{ list. , { 
ara ne a a I ee a a ee ONE | 


LSORT = LSORTA(0) : (RETURN) | 


am 
{ Entry point for LSORTA: LSORTA(N) where N is a power of 2 | 
{ will return a sorted list comprised of the first N links | 
{ of the list L (or all of the list if fewer than N links | 
{ remain). The variable L is treated as global and is al- | 
{ tered. If N is 0O the entire list will be sorted and | 
{| returned. { 


LSORTA IDENT (1) :S (FRETURN) 


ee ee Ee eT Ee ee eT ee pee EET Ee he Te Te 
{ Remove exactly one link from the head of the list. If N= | 
! 1, then we return immediately. | 
Ose et 

LSORTA = L 

L = NFLD(L 

NFLD(LSORTA) = 

I= 1 
LSORT_1 EQ (N, I) 2S (RETURN) 


re ee ee ee ge eS ee ee ee 
{ Otherwise our list is not sufficiently long. Let us obtain | 
{ another list of length I and merge the two. If L is null, | 


{ we are done. { 
| nn EN | 


L2 = LSORTA(I) : F (RETURN) 

L1 = =LSORTA 
nr ge ee eS pee eS ene OE ey ae ee Ee ee ae ee Gye Tete rae Op a NY ope dar Te 
| Merging kegins here. PTR will point to the receptacle | 


{ which will receive the next item. Flow goes to LSORT_L1 | 
{ if the next item is to come from list L1; otherwise, flow | 
{ falls through. { 


nn enero arent nentrenentceeennanestone shape oe-uontesannste wt ere nuaretnstnernsesneunsepeansinteaeiqtesinemanerumnensscmmeenennenenell 


PTR = .LSORTA 
LSORT_C PRED (VFLD (L1) , VFLD(L2) ) : F (LSORT_L1) 
Fn ee Ny eee ee TE Re ge eR RE Re a 
{ Choose L2; update PTR and L2; loop unless runnout in { 


{ which case the entire 11 list is appended. { 
ae ee 


$PTR = 12 

PTR = .NFLD(L2) 

L2 = NFLD(L2) 

IDENT (12) _- ¢ F(LSORT_C) 
$PTR = L1 : (LSORT_DONE) 


a a Ee ee 
{ Choose L1; similar comments as above apply. { 
er a a a NE EES | 
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LSORT_L1 $PTR = Ll 

PTR = .NFLD(L1) 

L1 = NFLD(L1) 

IDENT (L1) :F (LSORT_C) 

$PTR = 12. 
fe ae SE UE PO ye OE eae Me Bae ee Oe gee Ge eet ea oe ee ee 
| Our list (beginning at LSORTA) is now twice as long as it | 
{| was. Record this in I and loop back to see if this | 
{| suffices. { 
| ee | 
LSORT_DONE I = I * 2 : (LSORT_ 1) 
LSORT_END 
Cx ee ee 

Program The function MSORT is a sort based on the 


11 tt 

if 13.4 tt merging principle. A call to MSORT requires 
tI tH" only one argument, the array of strings to 
t_-~——.__—_-____-—_—4 be sorted. It assumes the array has a lower 
bound of 1 and obtains the upper bound by a call to the 
prototype function. 


MSORT(A) will not sort the array A but will return an array of 
integers (i.e. a permutation vector) which can then be applied 
to the array A and any passive array by using AI (Prog. 4.6). 
Thus if A is an array of names and if B is an array of (as- 
sociated) salaries then . 


I = MSORT (A) 
A = AI(A,I) 
B = AI(B,1I) 
will sort A and B according to alphabetic order of A. MSORT 


will sort numerical items if a second argument denoting the 
comparison predicate is given. Thus 


MSORT(B, 'GT*) 
AI (B,1) 
AI(A,1I) 


> oH 
ow ow 


will sort the two lists by salary (in increasing order). More 
exactly, an element X in the array B which appears before an 
element Y will be placed after this element if and only if the 
predicate GT(X,Y) holds. 


The coding of MSORT is based on the sorting algorithm designed 
for APL as described by Woodrum [1969]. He defines the notion 
of a chain of subscripts as follows. Let P be an array of in- 
tegers. Then, for any integer kK we have the sequence of 
integers (called a chain) 


K, P<K>, P<P<K>>, ...- 
We will assume the sequence terminates by the appearance of a 


0 subscript which will cause failure in the reference. In the 
cited paper, the sequence terminates by two consecutive equal 


85 
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subscripts. Such a sequence of integers can represent a list 
of elements of the array A as 


A<K>, A<P<K>>, A<P<XP<K>>>, oe 


Whereas it seems to be always necessary to allocate fresh 
storage in order to do a merge sort, the method of chaining 
permits us to merge without allocating any more storage than 
needed to contain the permutation vector. The behavior of 
MSORT is such as to form increasingly longer chains represen- 
ting sorted lists of elements of A. 


ee 
{ MSORT(A,OP) uses a merge sort to return an array of in- | 
{| dices which can then be used to sort the array A. OP is | 
{ the operation to be used to indicate ordering. { 
ae caps mn ci sre ea pi ee gs i nt mat tema ina ragweed 


DEFINE ('MSORT (A, OP) U,P,1,K,SAVE,AI, AJ") 


ee an ee ee ee 
{ CHAIN is an auxiliary function called by MSORT to chain | 
{| the indices in the global array P<L>, ..., P<U>. It | 
{ returns the top of the chain. It calls itself recursively. | 
We cessation incisional chemin cami dntcinoal 
DEFINE ('CHAIN (L,U) I, J,MIDDLE, R') 
: (MSORT_END) 


Gr ne ee ee RS OT ME Pane Oe Te ah We ee ee ee eee 
| CHAIN entry point: If the number of items to be sorted is | 
{ 1, just return the index. | 
Ni cet mvs aims i mm pli pss ei ieee ese snr ts i scenes sical 


CHAIN CHAIN = EQ(L,U) L 3S (RETURN) 


a ee ee eg ee ee ee ee 
{ Otherwise split the array into 2 parts, and chain each | 
{ part separately. { 
CESARE PD Er Se eS Sn a NN Oe SE | 

MIDDLE = (L + UV) 7 2 

I CHAIN (L, MIDDLE) 

J CHAIN(MIDDLE + 1,0) 


Gace ce en ee a, ae Ee ae eg. ne ae pee de eg ee Ee ge ee ee ee 
1 Now merge the 2 chains. The value to be returned will be | 
{ either I or J depending upon which should come first. This | 
{ is determined by the function CHAINOP which must be | 
{ defined by the caller. { 


enc 


CHAIN = I 
AI = A<I> 
AJ = A<d> 
CHAIN = CHAINOP(A<XI>,A<J>) J 


FO ee, ee ee ape a ET ee ee RT ne ee ee ee ae 
{ K will point to the last element in the chain being built. | 
{ Then branch to increment one or the other of the 2 | 


{ indices. | 
| ce re ee Ee EEE Te Ee | 
K = CHAIN 


EQ (K, I) 2S (CHAIN_I1) F(CHAIN_J1) 


Page 286 Chapter. 13__- SORTING aoe 


ee ee pr ee pe ee ee ee eg Ee 
{ Come here to make all subsequent comparisons. { 
Hinson iambic eile ie aii ai aaa tania 
CHAIN_COMP CHAINOP (AI,AJ) :S (CHAIN_J) F (CHAIN_I) 


Geer cr ae ge me eee Rey oP ae een pe ep ee en ee Fe a ET pe ee ee ere ee 
{ The I-chain has won; Place I on the chain and update the | 
| last-element pointer. { 
Masashi cenit i iin dhaneesesenic itt ieiiasieaininiomimintinicencsicisiial 
CHAIN_I P<K> = I 

K = f 


ee ee 
| Obtain next element from I chain and go back for a com- | 
| parison; if no more elements are left, fall through, |{ 


{ concatenate the remainder of the J chain and return. { 
ac ens iiss remo i's omnes ee emus nanan eipasianaassomneaemceaeimll 
CHAIN_1T1 I = Pp<I> 

AI = A<I> :S (CHAIN_COMP) 

P<K> = J : (RETURN) 
Gene ee ope ee et eM ee ET OT ee ee ee EL Oe ee a ee 
{ The following code is analogous to the code above; J and I | 
| have been interchanged. { 


cr reer erence et thet rest tsnsasesnsbareststnptse-ensnsubasencemneeausosirell 


CHAIN_J PCK> = J 
K = J 

CHAIN.J1 J = P<J> 
AJ = A<J> :S (CHAIN_COMP) 
P<K> = I : (RETURN) 


a a a ER eh ee ee Me EE SRE ee ee ey ee ee ee ote Tee 
{ Entry point for MSORT: Obtain comparison expression. Then | 
| allocate a permutation vector (P) and form a chain. { 
A re Ce RP ey SR Se eS eg ee I cee EN TO Rc Oe Oe ETT | 


MSORT OP = IDENT(OP) ‘LGT? 
OPSYN ('CHAINOP' , OP) 
U = +#PROTOTYPE(A) 
P = ARRAY (U) 
I = CHAIN(1,0) 


Gr ee Se ee ee Ee ee a 
{ Convert chain ky replacing in P<I> the value K where | 
| A<P<I>> is the Kth element of the sort. { 
nn TEE | 


MSORT_1 K = K+#1 
SAVE = P<I> : F (MSORT_2) 
P<I> = K 
I = SAVE : (MSORT_1) 


CS 
{ We now have the inverse of a permutation vector. Invert | 


{ it and return it. | 
faerie arn nm ier nmvis uid omic em un ni ists eannsansne-amniensnenavioesnmemeniiniasosiesamaseinionaascll 


MSORT_2 IP (P) 


MSORT = P : (RETURN) 
MSORT_END 
Names referenced Name Type Where defined 


by_MSORT: IP Function Program 12.6 


ee coe 
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Epilogue 


Merge sorting is quite fast. It not merely betters the figure 
of n logen comparisons (but of course not less than logs n!) 
but will take advantage of any pre-ordering that exists in the 
data. Its popularity for sorting arrays has been inhibited by 
the necessity of allocating additional storage. 


Orcs le ee 

{{ Program {|| A frequency sort on a string will return a 
(1 13.5 ie | string where the characters have been sorted 
{{ FRSORT {| on the basis of the number of occurrences in 
L____________ A the string. Each character will appear at 
most once in the returned string. For example, 


FRSOPT (*MISSISSIPPI') will return ‘ISPM*. 


This is an example of a sorting application which makes use of 
a passive array of information (the characters) while sorting 
on an array of numbers. It also serves to demonstrate the use 
Of MSORT. 


ee ee a Se Ge Ca ane Sg ge eee ee ge ee eS es ete ee ee 
{ FRSORT(S) will do a frequency sort on the characters of | 
{ the string S. The most frequent character will appear | 
{ first in the string returned. ! 
| rn re | 


DEFINE ('*FRSORT(S) SC,C,N, I") : (FRSORT_END) 


aa a A ERR | 
| Entry point: Obtain in the array C the set of characters | 
{ of which S is composed. Then allocate an array N to hold | 
| the number of occurrences in S of the corresponding | 


1 characters of C. { 
icecream pti emi ile ionising n seanenireamannisiitetiirtaatitinnidl 


FRSORT C = CRACK(SKIM(S) ) 
N = ARRAY (PROTOTYPE (C) ) 
SEQ(' N<I> = COUNT(S,C<I>) ' , .I) 


ee ee ee NE ee TE ge gE RG Oe GaN OR ee WE TR ee 
| Sort the indices of N and apply these indices to the array | 
{| Cc. Then convert the array to a string. { 
{hee sr ai PSPs SD S=a =a SsSSshvhlsss-h  se sS St-s narsirwstinemrneustr-ensseswel 


FRSORT = STRINGOUT(AI (C,MSORT(N,'LT*))) : (RETURN) 

FRSORT_END 

Names_referenced Name Type Where defined 

by _FRSORT: SKIM Function Program 3.11 
COUNT Function Program 3.4 
AI Function Program 4.6 
MSORT Function Program 13.4 
STRINGOUT Function Program 4.2 
CRACK Function Program 


Get 
SEG Function Program 4.3 
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ogee Re ie he Oe BE ee é 

{ #£8% ELECTION SORTING { In selection sorting the least ele- 
{ % c+ ment of the input aggregate is 
( £F&&% | selected and is placed into the output aggregate. 
{ % | This element can be chosen in the straightforward 
1 888% | way of making one pass through the array to deter- 
_—_—_.—_/ mine the least element. When an element is chosen, 
its position can be filled with a special marker to avoid 
selecting that element in the future. To select the least 
element in this way requires n-1 comparisons and hence this 
form of selection sort requires a total of n(n-1) comparisons. 
This is unfortunately far more than the theoretical minimum of 
n logon. 


But selection sorting can be continually refined until this 
lower limit is approached. For example, the n items can be 
subdivided into SQRT(n) groups of SQRT(n) items each. Assume 
that for each group a least item is known. Then a selection 
consists of first selecting the least of these least items. 
Then only the selected candidate's group must be searched for 
a least item to recompose the original situation. This kind 
of selection will be called order-2 selection and requires 


1/2 
2(n -1) 


comparisons for each item obtained. We may decompose our array 
into a group of groups of groups and so have order-3 selec- 
tion. Assuming each group has the same number of members (the 
cube root of n) then a selection would require 


1/3 
3(n -1) 


comparisons. For a level k hierarchy we would need 


17k 
a(n -1) 


comparisons per item. This value monotonically decreases as _ k 
increases and so it pays to make k as large as possible. In 
the limit the hierarchy becomes a binary tree. The 'winner' 
of each subgroup 'plays' the 'winner' of the adjacent subgroup 
to determine the winner of the group, etc. This method of 
sorting has the suggestive name tournament sort. The number 
of levels k becomes loge n and plugging | this value in for k we 
obtain 


‘loge n (2 - 1) =- loggn 


comparisons per extraction which is close to the theoretical 
limit. 
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eres earn ase ee Se ETS eee ee 


{( Program {| TSORT stands for Tournament sort; it also 
| 13.6 11 stands for Table sort since it can be used 
1 TSORT 1 to sort tables as well as one- and two- 
a dimensional arrays. The method by which 


tournament winners are recorded is by an auxiliary array of 
subscripts. Consider a typical tournament where the winner is 
decided by lexical ‘ordering (first in alphabetical order 
wins). The playoff of such a tournament is shown in Figure 
13526 
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Figure 13.2 


Here, subscripts, rather than actual values, are used to 
denote players in the tournament. Assume that the number of 
players N in the tournament is a power of 2. Then the tourna- 
ment can be recorded in an array T of length 2 * N - 1. For 
example the above tournament is represented as: 


12 3 4 5 6 7 8 9 10 11 12 13 14 15 


Gee ee ge a ee ee 

Array T {| 2 2 6 23 6 8 123 4 5 6 7 8 f 

a 
na Ce as 


Playoff results Base of tournament 


Here the elements T<8> through T<15> (in general, T<N> through 
T<2 * N - 15) hold the base of the tournament. The rest of 
array T is filled in as follows. To determine which subscript 
(of array A) should be placed into T<I>, a playoff is arranged 
between T<I * 2> and T<I * 2 + 1>. This method of recording 


eed Re ce SD LE SO ES AY a AED ee cA 


the. tournament is adopted from a tree-sorting algorithm by 
Floyd {1964], and can generally be used to encode a balanced 
binary tree. T<I> has sons T<I * 2> and T<I * 2 + 1> and has 
father T<I / 2>. 


The value found in T<1> is the subscript in A of the overall 
tournament winner. To find the runner-up, the winner is 
'disqualified' by assigning a zero subscript into his original 


slot. This is found by adding N - 1 to the subscript in A. 
Thus if A<2> is the winner, T<2 +N - 11> is set to 0 to 
produces: . 


123 4 5 6 7 8 9 10 11 12 13 14 15 


Qe nn a ey RTT et EM eee eg ae oy ae) here ge ge aM 
Array T {| 2 2 6 2 3 6 8 103 4 5 6 7 8 {4 
|p | 


A series of events is then run to resolve the outcome of games 
in which only he was involved. This is done as follows. The 
element T<9> was used in the battle to determine T<9 / 2> = 
T<4>. Hence we recompare T<2 * 4> and T<2 * 4 + 11>. -The 
resulting element T<4> is used to compute the new entry in 
T<4 / 2> = TX2>. This proceeds for Logs N steps until T<1> is 
determined. In our example, this produces: 


12 3 4 5 6 7 8 9 10 11 12,13 14 15 


Re ee ee ee eS ee oP ene ee ee ee Se eg, eee 
Array T {16 3 6 1 3 6 8 410 3 4 5 6 7 8 


The new winner, indicated by T<1>, is 6 which refers to 'BILL* 
in the original array A. This process is repeated until the 
winning index is a zero. 


TSORT(A,F,P) will use a tournament sort to sort the ele- 
ments of the array or table A according to predicate P. P 
May be absent in which case the assumed predicate is LGT. 
A may be singly-dimensioned in which case F, if nonnull, 
will indicate the field of a programmer-defined datatype 
on which the sort is based. A may also be a table or a 
doubly dimensioned array. In these cases, F may be an in- 
teger indicating the column on which to sort. If F is 
null, it is taken to be 1. The array A is not modified; a 
new array is allocated and returned. 

eee cr rn tn Asn hn snes sets soups eresboesnn-anppennicamaresmsavemnnswensll 
DEFINE ('TSORT (A, F,P)1,J,X,N,TS,T,P_1I_J,K,II,W') 


f 


irre nets im ee ee Se SOR Ge TE eg eg ee ee eee ee ee ee oe 

| PLAYOFF (K) is a utility routine used by TSORT to determine 

{ the winner of T<K * 2> and T<K * 2 + 1> and to modify T<K> 

{| accordingly. It will fail if K is < 1. The array T con- 

{ tains subscripts; some of these are 0 indicating open 

| slots. 

| Spe ea a I | 
DEFINE ("PLAYOFF (R) ') : (PLAYOFF_END) 

PLAYOFF LT (K, 1) :S (FRETURN) 
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T<K * 2> 3:F (PLF_J) 


I = 

J = TK * 24 1 :F (PLF_I) 

LE (I, 0) 7 2S (PLF_J) 

LE (J, 0) :S (PLF_I) 

EVAL (P_I_J) :S (PLF_J) 
PLF_I T<K> = TI 3: (RETURN) 
PLF_J T<K> = J : (RETURN) 


PLAYOFF _END 


Cr ee ee a a ge ee em ee ee 

{ TS will compute a tournament size needed for N elements; | 

| i. e. the smallest power of 2 2 N. ] 

i 
DEFINE ('TS (N) *) : (TS_END) 

TS Ts = 1 


| 
n 
aw 
yj 
n 
the fl 


LT(TS,N) TS * 2 2S (TS_ 1) F (RETURN) 
: (TSORT_END) 


ra ee i a ee I ee 
{ TSORT entry point: Compute the size of the tournament | 
{ (TS). Allocate the tournament array (T) and the array to | 


{ be returned. t 
ja a a eC I AE a ye EE OR a TT SE 


TSORT A = CONVERT(A,*ARRAY') 
TSORT = ARRAY (PROTOTYPE (A) ) 
N = PROTOTYPE (A) 
N  BREAK(*,") . N :F (TSORT_1) 
F = IDENT(F) 1 
TSORT_1 TS = TS(N) 
T = ARRAY(TS - 1 + N) 


| Sa AR a IR AA Ra IR a Raa aE EERE, | 
{ Initialize base of the tournament. ( 
[Ne | 
TSORT_2 I= r+1 
T<TS - 1+ I> = I 7S (TSORT_2) 

es ee ee ee 
{ Obtain comparison expression. { 
[rn ee | 
= IDENT(P) ‘'LGT? 

= F '(ACI>)," F '(A<g>y! 

= IDENT (DATATYPE (F) , ‘ INTEGER’) 

"'ACI,' F '>,A<J,' F '>? . 
I_J = CONVERT(P *(' X ')*, "EXPRESSION') 


PE ee Oe ee eR ee TT ee ee 
{ Now run a complete tournament determining an absolute win- | 
{ ner (in T<1>). { 
| en | 

K = TS 
TSORT_3 K = K-.1 

PLAYOFF (K) :S (TSORT_3) 


SSS SS SS ee 
{ Transfer the winning structure to TSORT. For a one- | 
{ dimensional array, this is simple. For a two-dimensional | 
{| array, we must go through a loop. ! 
je en CN Oe a EE TE | 
TSORT_4 Ir = Ir+¢t1 

w= T<1> 
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EQ (W, 0) :S (RETURN) 

TSORT<II DIFFER (DATATYPE (F) ,‘INTEGER')> = A<W> 
+ :S (TSORT_7) 

J = 0 
TSORT_6 J = T4171 

TSORT<KII,J> = A<W,J> :S(TSORT_6) 
a a I EN RS a TNA I BRB | 
{ ‘Disqualify' the winner. Replay all matches in which he | 
{ was involved. { 
cram tn cm csc ete meet ri i i tte primis smonamistarememsianasaieaia 
TSORT_7 K = T -1+4W 

T<K> = 0 
TSORT_5 K = K/Z/2 

PLAYOFF (K) 2S (TSORT_5) F (TSORT_4) 
TSORT_END 
Epilogue 


The tournament sort as given uses a near minimum number of 
comparisons but unfortunately allocates two additional arrays. 
For sorting structures, strings or. two-dimensional arrays, the 
additional allocation is probably not harmful since it will be 
small compared to the storage already allocated. Minimum core 
sorting of arrays such as HSORT (Prog. 13.2) and Treesort 3 
{Floyd 1964] have the unfortunate property of inverting equal 
elements and this, we will see, can be bad for sorting arrays 
of structures. Other minimum storage sorting algorithms such 
as BSORT (Prog. 13.1) and one by Shell [1959] have the 
property of not being minimum time. There appears to be, at 
this writing, no minimum-core sorting algorithm (i.e. an in- 
place sort) which is minimum time and inversion free. 


NSERTION SORTING | In an insertion sort the next 
ce =—Sis available element to be sorted is 
{ placed in the correct relative position in the output 
1 aggregate. This requires that the number of elements 
{ in the output aggregate be adjustable and suggests the 
_—4J use of a list, a string or a tree. A simple-minded 
insertion sort will compare the next item on the input list 
with each item in sequence on the output list until the cor- 
‘rect place is found at which point an insertion is made. This 
would require, on the average, n/4 comparisons for each inser- 
ted item. This is too many for large n. But for small n, 
where time is not an issue, this simple scheme has the advan- 
tage of providing a very simple sort. 


i tt SSORT(SS,S) is a string sort (or short sort 
ft 13.7 1] or simple sort). The string S is inserted 
| | into a string of strings (separated by com- 
L__.-___-_______4 mas) in SS. The augmented list is returned 
as value. For example, if the items in the input stream are 
being read in and are to be sorted one may execute 
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LOOP LIST = SSORT(LIST, TRIM(INPUT) ) :S (LOOP) 


If the input contained the names 'PAT', 'JOE', 'TOM' then the 
resulting LIST would contain ',JOE,PAT,TOM,'. Note that 
leading and trailing commas form part of the resulting string. 


DEFINE (*SSORT (SSORT,S) T') 
SS_PAT = ',' (BREAK(',') $ T *LGT(T,S) | RPOS(0)) .T 
: (SSORT_END) 


SSORT SSORT SS_PAT = S'S. t,"-T :S (RETURN) 
SSORT = ''s t,t? : (RETURN) 

SSORT_END 

Epilogue 


SSORT was written to ke as short and as convenient as 
possible. Its major failing is that it is slow. Not only is 
it a quadratic sort, but the data structure holding the sorted 
items is not the most conducive to high speed insertion. on 
the other hand, many if not most sort applications require 
only something ‘quick and dirty' and for such applications 
SSORT is recommended since it is not only easy to type but it 
saves on program space. 


1 Tt The insertion sort, like the other sorts, 
| 13.8 tt can be refined to the point where it becomes 
tI tt a logarithmic sort. To find the correct 
t_____________—4 position of the ith element we ought to com- 
pare it with the middle item. If it is > than this middle item 
it is compared with the middle item in the upper half, and so 
forth. Thus, to insert the ith item requires approximately 
logsi comparisons. The total number becomes (approximately) 


logal + logsg2 + ... + logan = logegn! 
which is the theoretical lower limit. 
This sounds attractive, but how does one find the middle ele- 
ment in each of these lists. The middle element of an array 


(or subsection of an array) can be easily computed but an ar- 
ray is not adjustable and its use would prove awkward in an 


insertion sort. That is, although the sort would prove 
logarithmic with respect to compares it would be quadratic 
with respect to moves. A list, on the other hand, is ad- 


justable and an element can easily be inserted within it, but 
the central element is not easily found. The solution is to 
use a tree as the receiving data aggregate. 


For example, assume that the following strings are to be 
inserted. 


NOW IS THE TIME FOR ALL. GOOD MEN 


exec ine core oe Qn aE ED Ere epee-enarcame-artie ee RNR | 
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If these strings are inserted into a binary tree, the result 
is depicted in Figure 13.3. 


| eee ss | 
-———{* Now *|——+ 


Ls | 
| 
v Vv 
CC. > 8 
—I* Is *|——_,  —s |. THE *|—_—_ 
I | 
Vv Vv Vv 
eee ae 
ra] * FOR *{|—— { MEN f{ { TIME |{ 
{ | eee | { | | | CRS Iereeeee | 
Vv v 
reo 
{ ALL | { GOOD | 
[| nes | | a | 


Figure_13.3 


The first string is associated with the root node. The second 
string is lexicographically less than the first and so is as- 
sociated with the left branch of the binary tree. Each 
additional string is compared with the node and _ successive 
descendents until an opening in the tree is found at which 
point the string is inserted. A trace through the tree will 
readily indicate the nature of this process. 


COR i ee ae ae ge GSE NED te pO ee EN eee Ng Pe ee pe eee eee ef me ee eee 

{ INSERT (T,S) will insert the string S into the tree T and | 

{ return the modified tree. If T is null a root node is | 

{ created and returned. ; ] 

i a ee ee kn a tig 
DEFINE ('INSERT(T,S) V') 


SS SS SS ey 
{ BTNODE is the datatype of a single node of a binary tree. | 
nn ne | 
DATA (*BINODE (VALUE, NO, LSON, RSON) *) 3: (INSERT_END) 
a a a aE a IRE | 
{ Entry point: If T is null, return immediately with a fresh | 
{ node. Else we prepare to return T and go on to modify it. | 
{ Get VALUE(T) out for fast and easy reference. If S equals | 
| value, increment count by 1 and return. { 
ae I ee a eC en ee a Ee ee eR ES | 
INSERT INSERT IDENT (T) BTNODE (S,1) :S (RETURN) 
INSERT T ; 
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V = VALUE (T) 
NO(T) = IDENT(S,V) NO(T) + 1 :S (RETURN) 


en ee ee ee EE eT RE PE EE Se OR ee eee 
|! If S > value, insert S into right half of tree; otherwise | 
{ into left half. { 
ae RN A ee er ea See Oe PE ee ee 


RSON(T) = LGT(S,V) INSERT(RSON(T),S) 3S (RETURN) 
LSON(T) = INSERT(LSON(T), S). : (RETURN) 

INSERT_END 

Epilogue 


Note that we do not create separate nodes for duplicate items 
but record a count in a field of the node. This saves on 
storage if the percentage of duplicate items is 20% or so. It 
also saves On compute time, especially if there are many 
duplicate items. For this reason, the binary insertion sort 
is ideal for preparing a word concordance which is a word- 
frequency analysis of a piece of text. 


a a I NO mae | 

| Program {1 LINEARIZE(T) will linearize a binary tree 
tI 13.9 (| of the kind used in INSERT (Program 13.8). 
{{| LINEARIZE {| The tree will be strung via its right 
J sons. The value returned will be the first 
node of the tree. If T is null, LINEARIZE will fail. 


DEFINE ('LINEARIZE (T) ') : (LINEARIZE_END) 

re ar re ae pee ee ee ee ee ee ee ee ee ee 
{ Entry point: { 
Acinetobacter ihe senate amici cesiinininlciaasianisaasieetalh 


LINEARIZE IDENT (T) :S (FRETURN) 


er eg ag ee a ee ee ea ee ee ee ee ee ee oe pe ee ee ee 
{ Linearize the left side and attach on node T (LAST_NAME is | 
{ a global variable set to equal the name of the last link | 
{ on the chain). { 
sen cene cenmesssensrscnone=r srs es seers SSS hp sess sts > sr PU  Peees daydunessesataseanaasennvcmnnanacnarell 


LINEARIZE = IDENT(LSON(T)) T 2S (LIN_1) 
LINEARIZE = LINEARIZE(LSON (T) ) 
$LAST_NAME = T 


SiS R aA ARSE IEA TEI ASGRSIC SR SCRRG ERIS SE AEE RR | 
{ Now linearize the right-hand side. | 
rc a ep a we eS 
LIN_1 RSON(T) = LINEARIZE (RSON(T) ) '?S (RETURN) 

LAST_NAME = .RSON(T) : (RETURN) 
LINEARIZE_END 


11 | With some sorting procedures, an almost- 
1{ 13.10 ({ sorted input will serve to decrease sorting 
({ INSERTB {| time. The speedup is most pronounced with 
_________________ the bubble sort but pre-ordering will 
favorably affect the merge and Hoare sort as well. With the 
tree insertion sort we have the reverse phenomenon. If the 


Page 296 _ 


EE NE RE ENT A A NR eT te a ee cer nce eSATA HS AERA nA RES AR AR ACER 


elements inserted are already in alphabetic order the number 
of comparisons to insert the Ith element is I-1, the worst 
case. The logarithmic sort becomes a quadratic sort. Per- 
versely, if the elements are initially in reverse alphabetic 
order, we also achieve the worst case of I-1 comparisons for 
the Ith element. 


But the insertion sort can be modified slightly to not only 
avoid the inefficiences of almost-ordered data but to actually 
take advantage of any ordering that exists. The trick is to 
grow the tree backward! that is, the last node to be inserted 
should become the root of the tree. 


For example, if the sequence of strings is 
Now IS THE TIME FOR ALL GOOD 


the tree grown backward becomes as shown in Figure 13.4. A 
rough rule for growing the tree backward is the following. 
Draw an imaginary line down the middle of the tree separating 
all nodes < the new root from all nodes > than it. Any path 
broken by such a line should be 'short circuited* so that all 
pointers from any node are directed to nodes in the same half 
of the tree. As an example, the result of adding the string 
'MEN' to the diagram in Figure 13.4 is shown in Figure 13.5. 


{ ALL *(|———_——, co * TIME 
ae | \ { a eee 
v 
see. 
{ FOR | | * THE | 
Retoe os { eee, | 
v 
{ Is *|-—— 
 aeereaay | { 
v 
SS 
{ Now | 
fod 


Figure 13.4 
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| tJ { 
v ' 
——< { 
——1* GOOD *|——___, | 
\ _—____ { 
Vv { Vv 
{ ’ 
{ ALL *|-————— { cae | * TIME 
Vv { Vv 
— ) 
1 FOR | \ r———|* THE | 
v | 
oo oe { 
' Is | { 
| eee | | 
v 
{ Now | 
| ee 
Figure 13.5 


a a a en 
{ INSERTB(T,S) will insert the string S into the backward- | 
{ growing binary tree T. The root of the returned tree will | 
| contain S. | 
| re | 

DEFINE (' INSERTB (T,S) V") 

DATA (* BTNODE (VALUE, NO, LSON, RSON) ') 

: (INSERTB_END) 


CO ea BP ae ee Ee a ea ET EE ON, oa PT Se eee a ee a ae 
| Entry point: The first part is similar to INSERT. Com- | 
{ ments there are appropriate here. { 
| ee | 


INSERTB INSERTB = IDENT(T) BTNODE(S,1) :S(RETURN) 
Vv = VALUE(T) 
NO(T) = IDENT(S,V) NO(T) + 1 :S (RETURN) 


Crys page a a Rr ag REN GE re ee ee et Ps ge oe Pe ea ee ee 
{| If S > value, insert S into the right half of the tree. | 
{ The root node of the returned tree will have a VALUE of S | 
{ and will become the root node of the tree we will be j 


{ returning. | 
Weenies mip iii pristine easement tats ceettinaarnicimmisimionsn 
LGT (S,V) :F (INSERTB_L) 
INSERTB = INSERTB (RSON(T), S) 


Cay ee ee a ee Nee a a ea Sp Eee Fe gee ee ede eae Ce Re em ee ee, oh pre tee ee 
{ Include the rest of T under the left side of this new | 
{ root. { 
na he ey 
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RSON(T) = LSON(INSER‘B) 
LSON(INSERTB) = T : (RETURN) 
ee ee ee ee 
{| Do an analogous thing for the opposite side. | 


INSERTB_L INSERTB = INSERTB(LSON(T), S) 

LSON(T) = RSON(INSERTB) 

RSON(INSERTB) = T : (RETURN) 
INSERTB_END 
Cea eR. 
{ 88% ISTRIBUTIVE SORTS | So far, every sort we've presen- 
is & ted was a comparative sort. There 
1% £ | are other kinds, however, and these we can all lump 
1% 8 | ‘together in a category called distributive. In a 
1 8% | distributive sort, each item to be sorted is placed 
tJ in a position with respect to the other items ac- 


cording to some parameter of that item. This has the attrac- 
tive feature of not keing binary and thereby cne can better 
the n logan limitation. For example, if one is sorting real 
numbers, uniformly distributed between 0 and 1, an excellent 
technique is to begin distributing the items one at a time in- 
to the receiving array in approximately their final position 
depending only on their value. Unless one is lucky, collisions 
will begin to occur as the receiving arzray is filling up, but 
the time. to patch up such discrepancies is assumed to be small 
compared with the time saved by the almost-one-pass nature of 
the ‘sort. The effectiveness of such a sort is highly data 
dependent, however, and for this reason is not very popular. 


A more familiar distributive sort is the radix sort. This is 
the sort used on mechanical sorters which distribute cards in- 
to bins. Assuming n cards are to be sorted on a field con- 
taining k characters, a distribution over the least 
significant character is made first. The clumps are gathered 
together and passed through the machine again, this time on 
the next least significant character. After k passes, the en- 
tire deck is sorted. The number of operations is nk rather 
than n loggn because each operation involves pitching a card 
into one of several bins and such an operation yields more in- 
formation than a binary choice. 


We do not have space to describe a SNOBOL4 rendition of the 
radix sort but happily refer the reader to the original SNOBOL 
article {Farber, et al 1964] where it appeared as an example. 


222 22222222222 22222 222222222222 2272222222 22722222222 222222 
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Se 

{ Exercise 13.1 | What two instructions constitute the inner 


____.._-___—-J loop of BSORT? Can the reader recommend a 
slightly faster version? 
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Ce ee ee ee ee 

| Exercise 13.2 | Prove that in HSORT the value of K when the 
L_—__._-___-.._-_—--—4 recursive call HSORT(A,I,K) is made is al- 
ways less than N thereby removing the possibility of an 
infinite loop. 


SS ee ee 

| Exercise 13.3 { Write a non-recursive version of HSORT 
4. using PUSH and POP (Programs 5.5 and 5.6). 
Hint: This can be done by modifying 2 go-to fields and adding 
5 very simple instructions in place of the 2 recursive calls. 


ee 
{ Exercise 13.4 | Given 3 items to sort, what is the average 
_—____________-——J number of comparisons required by BSORT and 
by HSORT. Note, as a consequence, that BSORT will actually be 
faster than HSORT for small arrays. Estimate the crossover 
point at which the number of comparisons are the same. Then 
modify HSORT so that it calls BSORT for arrays smaller than 


this. (The estimate may be made on analytical or empirical 
grounds.) 

CS at ee ee ON 

{ Exercise 13.5 { The elements of an array A are to be sorted 


t_________-__.__—! numerically in ascending sequence but all 
numbers within a certain range R of each other are to be 
regarded as numerically equal and are to retain their relative 
ordering. Using MSORT, define an appropriate predicate and 
sort A accordingly. 


Ce 

{ Exercise 13.6 {| Assume we wish to sort an array of strings, 
L—______-________J A, alphabetically as defined by the 
predicate AGT (Prog. 3.13). We could call MSORT(A, ‘'AGT'). 
What is a more efficient procedure? 


eens 

{ Exercise 13.7 {| Both MSORT(A, 'LT') and MSORT(A, 'LE') can 
L—____-__.__—---—1 be used to sort A in decreasing numerical 
order. ‘The difference between the two is in the way equal 


elements are treated. Which should be used so that the rela- 
tive order of equal items is retained. 


oe ee ee ee 

{ Exercise 13.8 | SSORT can be speeded up considerably by the 
__________.____§ following technique. Represent a binary 
tree as a string by the following method. The null string is 
the null tree. A tree with root R is represented as: 


(LSON) R (RSON) 


where LSON is the string representation of the left son of the 
tree and RSON is the representation of the right son. Then 
BAL can be used to rapidly scan for an insertion point. A tree 
is built up much in the manner of INSERT. Rewrite SSORT so 
that the string returned is this tree. 
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Cor te ee ee ee 

{ Exercise 13.9 | The body of SSORT (Prog. 13.7) need only be 
L.___________-_J. one statement. Modify the pattern SS_PAT 
so that the :S(RETURN) can be changed to :(RETURN) and the 
second statement deleted entirely. 


Cn nn 

{ Exercise 13.10 | One can enhance the speed of INSERT by 
—__-____-_____——--JI__ periodically balancing the tree. Write a 
function TREEBAL(N) which will balance a tree beginning at 
node N and return the root of the balanced tree. The use of 
LINEARIZE to write this function is optional. 


CS ee a ee ; 
| Exercise 13.11 | Modify LINEARIZE so that the LSON fields 
L.-J. are cleared. 


ne ra ee ee oN 

{ Exercise 13.12 | Modify LINEARIZE so that it counts the 
L___.-___._-______--—--J_ number of nodes in the tree. Assume some 
global variable exists (say N) which is initially 0. 


OOS ee ee eee 

| Exercise 13.13 | The average number of comparisons of a 
t_______________-J. logarithmic insertion sort was estimated 
in the text to be logsn! This average would be achievable by 
INSERT only if the tree is always kept perfectly balanced. But 
for random data this will not be the case and the expected 
degree of unbalance can ke computed. 


a) Determine the average number of comparisons required by 
the tree-insertion sort. Assume that every input permutation 
is equally likely and that no two items are identical. 


b) AS n approaches infinity, what is the ratio between this 
number ann logon. | ; 


{ Exercise 13.14 { What does the tree resemble when the fol- 
——_____________.-5. lowing strings are placed into a) INSERT 
and b) INSERTB? 


A QUICK BROWN FOX JUMPED OVER THE LAZY DOG 
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Se EE aD Sa a ES EE ED Se CED CE EE 


unorthodox. In conventional languages, a function (or 

its equivalent) is defined at compile time. Thus, its 

entry point, number and type of arguments, tem- 
us poraries, etc. are fixed for the duration of the 
- program. In SNOBOL4, these are governed by arguments to the 
DEFINE function. Since these arguments can be the product of 
an arbitrary computation, and since the DEFINE function can be 
called at any time, the function-defining facility is extra- 
ordinarily flexible. This section shows several examples of 
how this flexibility can be harnessed to produce more ef- 
‘ficient, better structured and more powerful programs. 


t,,-— he function definition facility in SNOBOL4 is somewhat 
i! 
11 
| 


CS ee oe , 
{ {|= Program {| DEXP (proto) permits functions to be easily 
| 14.1 " defined in terms of simple, one-line expres- 
WI { sions. For example: 

[ eee | 


DEXP (‘AVE (X,Y) = (X + Y) 7 2.0%) 


will define the function AVE(X,Y) to be equal to half the sum 
of xX and Y. It thus mimics the Fortran arithmetic function 
facility. It is, however, much more powerful, since any se- 
quence of statements separated by semicolons may be used to 
specify a function. In fact, arbitrary functions may be 

defined in this way. 


DEFINE ('DEXP (PROTO) NAME, ARGS'*) 3 (DEXP_END) 


orn re ne ee ee ep ee ee I Te a RT ERY EE ee ee 
| Entry point: First remove leading blanks, just in case. | 
{ Next obtain the name of the new function (NAME) and its | 
{ argument list (ARGS), removing the latter. Sa 
| En nnn | 
DEXP PROTO POS(0) SPAN('t ') = 

PROTO BREAK('(') . NAME BAL .«. ARGS = NAME 


et 
| Create code which will be the body of the new function. | 
{ Then DEFINE it. { 
a a ee ee ey 
CODE(NAME ' * PROTO ¢ :S (RETURN) F (FRETURN) ') 
DEFINE (NAME ARGS) ; 3 (RETURN) 
DEXP_END 


Epilogue 


Care must be taken in the use of DEXP. If the last statement 
of a sequence fails, the entire function might inadvertently 
fail. This can be cured ky placing a semi-colon after the last 
statement (null statements always succeed). For example, we 
can define SIGN(X) which returns +1 if X > 1 and -1 if X < 1 
and null if x = 0 as: 


DEXP ('SIGN(X) = GT(X,0) 1 ; SIGN = LT(X,0) -1 3") 


({ Program {| One of the most frequent requests that 
H{ 14.2 t{ SNOBOL4 users make is for more space. If 
tt {1 lack of main storage is due to the size of 
td the program, then this next function, or 
some variant of it, can be used to obtain more. core. The 
function DEXTERN (Define EXTERNal function) will allow for the 
dynamic loading of SNOBOL4-coded functions. The arguments to 
DEXTERN(proto,label) are identical to those of the built-in 
DEFINE function. DEXTERN will create a small provisional 
function body for each such function. This will cause the 
first call on that function to result in the function being 
loaded from an external file, compiled and executed. Subse- 
quent calls go straight into execution with no overhead. 


_ DEFINE (* DEXTERN (PROTO, LBL) NAME‘) 
DEFINE (* LOADEX (LEL) PAT, X, CODE") 
LIB_ = Some Library File Designator : (DEXTERN_END) 


We al a eee pe ge Ca ee em gh ee ee ee PO ee ee eee ee eg ee Pee ee ee 
{ Entry point for DEXTERN. Determine the label (LBL) and | 
{ compile code which serves as the function body until the | 
{ first call. Then define the function. 1 
serie eensre nian etary s sivr-s -r.rsstssesutivomscenseniinesetnseiramaw 
DEXTERN PROTO IDENT (LEL) BREAK ('(*) . LBL 
CODE(LBL " LOADEX('™ LBL ™*) 3 :(" LBL "yj ) 
DEFINE (PROTO, LBL) : (RETURN) 


Ce re ee ee kp ee eg eee STE PE EST RE ee eT REE TE TT ee ee ee ee 
{ Entry point for LOADEX(LBL). LOADEX will. load an external |{ 
{ segment of code beginning with label LBL and ending with | 
{| LBL_END. . I 
| 
LOADEX REWIND (LIB_) 
INPUT (.LIB_FILE, LIB_) : 

Cenc ae pe Ge oe ST ee Ere ee a ee ee ee oe te ee ee 
{ Loop to look for function { 
| Ee ee ee a a aE a ee I | 


PAT = POS(0) LBL (* * | RPOS(0)) 
LOADEX_1 CODE = LIB_FILE : F (ERROR) 
CODE PAT :F (LOADEX_ 1) 


Gr a ee eee ey 
{ Loop to process statements. Note conventional continuation | 
{ and comment characters. i] 
| SES | 


PAT = pos(0) LBL ‘'_END' (* * { RPOS(0)) 
LOADEX_2 X = LIB_FILE 3 F (LOADEX_ 3) 
X PAT 2S (LOADEX_ 3) 
x Pos(0) ANY('*-') :S (LOADEX_2) 
X = ‘fs! xX 
x Pos(0) ‘'s:* ANY(*.4') = ' # 
CODE = CODE X : (LOADEX_ 2) 


rg ne ee ee ee ee 
{ Now code it up and return. { 
Noaeeee wees eee cmnneyn apenas er saeetesinsetPusUm srs ssh rt sss dst tn PPPs eae venEn-ePenrunaanamenasomemouneinell 
LOADEX_3 CODE (CODE) : (RETURN) 
DEXTERN_END 


Epiloque 


One reason for the DEXTERN function is convenience. 
Frequently-used subroutines need not be copied into a given 
program but may be kept in a file which serves as a library. 
In this way several programs may share a common library and 
may be assured of up-to-date copies. 


Another reason for DEXTERN is that it permits the running of 
many large programs which would otherwise not fit into core. 
Most large programs have significant portions that are infre- 
quently used and it is extremely rare to encounter an applica- 
tion which requires all the facilities of the large program. 


The text processing system used to write this book is a good 
example of this. There are approximately 1200 statements in 
the main program and approximately 1500 in an external 
library. Each chapter of the book may be processed within 
prime-shift limits since no chapter uses all the facilities of 
the text processor. However, the entire book requires an 
evening run. 


It is not necessary to dynamically load source programs on a 
per-function basis. See Exercise 14.5. 


{! Program {{ One advantage of decomposing a large program 
11 14.3 11 into functions is that the values passed to 
1 (1 a function and the value returned can be 
t_._-——--—______J easily monitored by means of the &FTRACE 
switch. Unfortunately, only strings, reals and integers are 
printed explicitly. Other data objects such as patterns, ar- 
rays, tables, etc. result in only the datatype being printed 
(with possibly an identification number as in SITBOL). This 
deficiency can be corrected by the programmer, however, by 
using the available trace facilities. In particular 


TRACE( NAME, ‘CALL', , FNAME) 


will cause the function named FNAME to be invoked when the 
function named NAME is called. FNAME can determine sufficient 
information about the called function (such as its arguments 
via the ARGS function) to produce an elaborate display of any 
aggregate passed as argument. The second argument to TRACE 
can be the string 'RETURN' which can enable a similar function 
to display the returned value. 


One weakness of the scheme is that unlike the &FTRACE switch 
which affects all function calls, the TRACE function requires 
two explicit calls for each function traced. The FTRACE func- 
tion defined here is designed to automate this process. It is 
simply placed once in the program before all functions which 
are to be traced. FTRACE will redefine the DEFINE function 
and thereby sieze control at each function definition. The 


Program 14.4 - INSULATE Page _305 


functions actually called to do the tracing (FTR_CALL and 
FTR_TRC) are left as exercises. 


DEFINE ('F TRACE (PROTO, LABEL) NAME‘) 
OPSYN('DEFINE. ', 'DEFINE') 

OPSYN('DEFINE', 'FTRACE') 

ETRACE = 10000 : (FTRACE_END) 


ee Se pe Cee ee eT a ae ge TE eg ee 
{| Entry point: Define the function, issue the trace requests | 
{ and return. { 
| re ne | 
FTRACE DEFINE. (PROTO, LABFL) 

PROTO BREAK('(') . NAME 

TRACE(NAME, 'CALL', , *¥FTR_CALL') 

TRACE(NAME, 'RETURN', , 'FTR_RET') : (RETURN) 
FTRACE_END 


Ga co ee 

{1 Program it This routine can protect other routines 
| 14.4 {| from possible malfunction owing to an unan- 
{{| INSULATE {{ ticipated modification of some global 
t__-_—_-______-__-_—__ variable or keyword. As written, protection 
from modification of the &ANCHOR keyword is obtained, but this 
protection could be extended to include other keywords and 
glokal variables as well. 


While it is held in these pages that modification of the 
&ANCHOR keyword is seldom warranted and is inconsistent with a 
general functional scheme of decomposing and structuring a 
large program, it is nonetheless true that occasionally one 
encounters two separately written sections of code that in- 
teract with each other and that depend on opposite values for 
the &ANCHOR keyword. For example, if routines in this book 
were called from a main program which assumed anchored mode, 
then pandemonium would he the general result. 


To rectify the situation short of recoding one or the other of 
the two ill-fitting sections one may insert the INSULATE 
function. 


ee ee ee eT en ey Le er eh ee eT ee ee a ee 
{ INSULATE will cause each function following it to trap to | 
{ INS_CALL() when called and to INS_RET() on return. This | 


{ requires redefining DEFINE to point to INSULATE. | 
a al 


DEFINE (' INSULATE (PROTO, LABEL) NAME') 
DEFINE ("INS_CALL () ') 

DEFINE ("INS_RET() ") 
OPSYN('DEFINE.', 'DEFINE') 
OPSYN('DEFINE', '‘INSULATE') 


&TRACE = 100000 : (INSULATE_END) 
Gee ee ey oe ea PR eee pe ee ee ae, pe ee ee a oe Pe ae eee 
{ Entry point for INSULATE. Define the function and set up | 
{ tracing. { 


rrr ne errr east ent een nnanarasnsonneneeeeneeraernsnenseranemsnnnemmmemmmnnsnsel 
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INSULATE PROTO BREAK('(') . NAME 

DEFINE. (PROTO, LABEL) 

TRACE (NAME, 'CALL',, ‘INS_CALL‘) 

TRACE (NAME, ‘RETURN! ,, ‘INS_RET*) ? (RETURN) 
ee Ee, 
| The two routines. t 
cece ve re eit a mi re ri ice entice niga canta cial 
INS_CALL PUSH(SANCHOR) ; GSANCHOR = 0 _ ¢ (RETURN) 
INS_RET 6ANCHOR = POP() : (RETURN) 


INSULATE_END | 


Names_referenced Name Type | Where defined 
by_INSULATE: PUSH _ Function Program 5.5 

POP Function Program 5.6 
Epilogue 


Note that when a routine is called and INS_CALL gains control 
it calls the routine POP(). If tracing were on, at this point, 
POP would presumably be traced sending control to INS_CALL 
again; an infinite loop would be the sad result. But the 
&TRACE switch is conveniently turned off at this point and 
restored on return. As Dickman and Jensen (the original im- 
plementors of the SNOBOL4 trace facility) put it, the ‘stout 
of heart! can turn tracing on after the function receives 
control. 


Gee a kee ee en 

{{ Program t1 SNOBOL4 has the ability to redefine built- 
({ 14.5 tl in operators and functions. Thus we may 
{{} REDEFINE |{{ write 

| ee | 


OPSYN (* +t, '*", 2) 


indicating that the binary operator '+* is made equivalent to 
binary '**. All additions thereafter become multiplications. 
OPSYN can be used for named functions as well as operators and 
user-defined functions as well as built-ins. 


While the basic facility exists, we are here concerned with 
its proper and effective use as a programming tool. Undoub- 
tedly it has already occurred to the reader that he can play 
‘fool the counselor' with an OPSYN as above. Let us assume, 
however, that we are above such pranks. A semi-legitimate use 
of redefining an existing facility is as follows. Being un- 
familiar with the language, and in particular unaware of the 
built-in function REPLACE, a programmer writes a user-defined 
function REPLACE as part of a larger program. Subsequently he 
learns of this built-in facility and wants to use it. He may 
write 
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before defining REPLACE and use REP() to obtain the built-in 
facility. 


This use is only semi-legitimate for if the program is to have 
a long life, he would be better off redefining his original 
function, even if more painful, than in redefining a built-in. 


Redefining a built-in is normally only justifiable as a design 
objective if one is writing a facility designed to be upward 
compatible with an existing one. For example, one may redefine 
the operator ‘'+' to sum arrays, complex numbers or physical 
quantities but in that case it should treat conventional ob- 
jects (integers, reals, strings) as it did prior to the 
redefinition. 


REDEFINE (OP, PROTO, LABEL) is intended to make such upward com- 
patible extensions. The first argument is an operator to be 
redefined, or, if a function is redefined the first argument 
is null. The name of this function can be taken from the 
second argument which is the function prototype normally given 
to DEFINE. 


DEFINE (REDEFINE (OP, DEF, LBL) NAME,N, FLAG‘) 
3 (REDEFINE_END) 


ee ee ee 
{ Entry point: Extract the function's name (NAME) and deter- | 
{ mine the number of arguments (N = 1 or 2). { 
a oP ea ge A SE ac ASN IE ONE ee DNS UR rr RE ee ee Rae Cree ONO eT OES 


REDEFINE DEF BREAK('(') . NAME '(* BREAK("'),') LEN(1) . FLAG 
N = 1 
N = IDENT(FLAG, ‘',*) 2 
ce ee ee eg ee er eee ee tae eee eg Ne ee ee ee ee Te 
{ But if the first argument is null, we are not talking | 
{ about an operator (OP) at all but a named function. ] 
(a seonesesepnnceses cunts rp rs ll re Ss st elt st FSG tse etsoenocasennsenpatenraansncee ll 
N -= IDENT (OP) 
OP = IDENT(OP) NAME 
OPSYN (NAME '.', OP, N) 
DEFINE (DEF, LBL) 
OPSYN(OP, NAME, N) : (RETURN) 
REDEFINE_END 


Epiloque 


In order to avoid defining away the built-in facility ir- 
retrievably, REDEFINE will OPSYN to it a created name formed 
by appending a period to the function's name. For example, 


REDEFINE ('+!*, 'SUM(X, Y) I‘) 


will cause SUM.() to be defined and equivalenced to the old 
binary + while binary + will now be equivalenced to SUM(). 


REDEFINE can substantially simplify the task of extending a 
range of built-in operators. This is best illustrated by ex- 
ample as in the next program. 
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Cs ea ee 

1{ Program | To illustrate the redefinition facility and 
tt 14.6 tt to create a possibly useful extension to 
{| PHYSICAL {] SNOBOL4 we will define the four fundamental 
Cn ae cel operators of arithmetic to operate on 
'physical' quantities. For examole, a quantity such as four 
meters divided by a quantity such as two seconds produces a 


speed of two meters-per-second. Normally, physical quantities 
are represented by some combination of units of length, mass, 
time and charge. We will illustrate our system with the near- 
standard MKS system (Meters-Kilograms~Seconds-Coulombs) but it 
should be obvious that any other system can be employed. 
Indeed, the subroutines, as written, depend in no way on our 
particular universe; any type and number of physical quan- 
tities may be employed (up to the size of &ALPHABET). 

Physical quantities will be represented by a 
defined datatype defined as 


programmer- 


DATA ('*PHYS (VAL, NUM, DEN) ) 


where VAL is the numerical value, NUM is the numerator of the 
units field and DEN is the denominator. Units are represented 
by single letters. For example, 3.5 meters/second2 may be 
represented as: 

PHYS (3.5, 


'', SS") 


DATA (* PHYS (VAL, NUM,DEN) *) 


Ge ae ee eR ap eT ee ee ee Pe TL aN ey oO ie eee eM 
| The following operators and one function are redefined. { 
a aN a a | 


REDEFINE ('-', 
- REDEFINE ('+', 
REDEFINE ('-', 
REDEFINE ('*', 
REDEFINE ('/', 
REDEFINE( , 


"MINUS (X) *) 
"SUM (X,Y) ') 
"DIFF (X,Y) ') 
"MULT (X,Y) ") 
'DIV (X,Y) ") 


"EQ (X,Y) ‘) 


{ NORM(X) will normalize a physical quantity, meaning that | 
{ we obtain a unique specification for comparison purposes. | 
{ This is done by sorting the physical units and canceling | 
{ common factors across the division bar. { 
resistence pipe sais pian pss imanaeetiigiidenndasinenelinaiaanianmtsiiesisall 
DEFINE (' NORM (X) C‘) : (NORM_END) 


NORM X = DIFFER(DATATYPE(X), 'PHYS') PHYS(X) 
NORM = X 
DEN(X) = ORDER (DEN(X)) 
NUM(X) = ORDER (NUM (X)) 
NORM_1 IDENT (DEN(X)) :S (RETURN) 
NUM(X) ANY (DEN(X)) . Cc = :F (RETURN) 
PEN(X) Cc = : (NORM_1) 
NORM_END 
= 
{1 X¥() will normalize the two arquments of an arithmetic | 


{ operation (assumed to be X and Y). As an added bonus, XY() | 
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{ will return success only if neither argument is a physical | 
{ quantity (in which case the old operation can be applied). | 


Ni neers tshirts li lees aestaeloaanbatnaaciiatieitlicll 
DEFINE ("XY () ') : (XY_END) 
XY (DIFFER (DATATYPE (X), *PHYS') 
+ DIFFER (DATATYPE (Y), 'PHYS"')) :S (RETURN) 
X = NORM(X) 3; Y = NORM(Y) : (FRETURN) 
XY_END : (PHYSICAL_END) 


Gr en eg ee Se a ee ee ee eg ee ee 
{ The definitions of the separate functions are now greatly | 
{ simplified because of the utilities written above { 


nents serene ttre ete ee ene ene > renseneenaenee esas cnn enna sneinterreneeenemnenall 


MINUS MINUS = XY() MINUS. (X) :S (RETURN) 
MINUS = PHYS (-VAL(X),NUM(X) ,DEN(X)) : (RETURN) 
SUM SUM = XY()_ SUM. (X,Y) : S (RETURN) 
SUM = PHYS (VAL (X) + VAL(Y), NUM(X), DEN(X)) : (RETURN) 
DIFF DIFF = X + -¥ : (RETURN) 
MULT MULT = XY() MULT. (X, Y) :S (RETURN) 
MULT = PHYS(VAL(X) * VAL(Y), NUM (X) NUM (Y), 
+ DEN (X) DEN(Y)) ¢ (RETURN) 
DIV DIV = xXY(Q) DIV. (X, Y) 3S (RETURN) 
DIV = PHYS(VAL(X) / VAL(Y), NUM(X) DEN(Y), 
+ DEN (X) NUM (Y) ) s (RETURN) 
EQ XY () :F (EQ_1) 
EQ. (X, Y) :S (RETURN) F (FRETURN) 
EQ_1 (EQ (VAL (X) , VAL (Y)) IDENT (NUM (X) , NUM (Y)) 
+ IDENT (DEN (X) , DEN (Y) ) ) :S (RETURN) F (FRETURN) 
PHYSICAL_END 
Names referenced Name Type Where defined 
by _ PHYSICAL: REDEFINE * Function Program 14.5 
ORDER Function Program 3.1 


* “indicates name is referenced in the initialization section. 


Epi loque 


As an example of the use of physical arithmetic, we may 
assign: 


MET. = PHYS(1, 'M‘) 
SEC. = PHYS(1, 'S‘) 
KG. = PHYS(1, *K'*) 


and from now on we need not so much as employ the PHYS(} func- 
tional form as it will be called implicitly. Thus a Newton is 
a Met.2/Sec.2 so we writes: 


NEWT. = (MET. * MET.) / (SEC. * SEC.) 
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and a Joule is a Newton-Meter: 
JL. = NEWT. * MET. 


Though we are using an MKS system as a base for our physical 
quantities, we can specify any given problem and perform all 
calculations in thoroughly colloquial units. For example, we 
can express foot, mile and acre as: 


IN. = MET. / 39.4 

FT. = 12 * IN. 

MI. = 5280 * FT, 

ACRE = (MI. * MI.) / 640 


We may then express computations entirely in the new units. 
For example, to print the acreage of a plot of ground 200' by 
250 we write: 


OUTPUT = VAL(200 * FT. * 250 * FT. / ACRE) ' ACRES! 


We may even dispense with the asterisk between 200 and FT. but 
this is left as an exercise. 


£¥€% o-routines and state functions | The notion of co- 
£ routine is of in- 
% { terest from several standpoints. . In theoretical 
& { circles, it is as worshiped a programming practice 
{ as the goto is deplored. However, this theoretical 
_—___—-s_ enthusiasm does not carry over to the practical 
world. » Practical programmers shun co-routines to a greater 
extent than they embrace goto's. Nonetheless, techniques for 
the construction of well-formed programs are not very well 
developed nor understood at this writing and study of the co- 
routine protocol is warranted merely for the light it can shed 
on this other, more general, issue. 


As remarked by Knuth (Vol. 1, p. 191], small examples of co- 
routines do not seem to exist and so we must construct a 
somewhat elaborate situation merely to demonstrate what it is. 
The best example seems to be one furnished by a compiler. As 
we have discussed previously (Chapter 11), a compiler is fre- 
quently decomposed into lexical analysis and syntactic 
analysis. The purpose of lexical analysis is to decompose a 
string into a sequence of discrete non-decomposible objects 
frequently represented by pointers into a symbol table. Thus, 
the portion of SNOBOL4 program: 


(ALPHA + BETA GAMMA) 


will be analyzed by the lexical analyzer into seven compo- 
nents, i.e., left parenthesis, ALPHA, binary plus, BETA, 
binary blank, GAMMA and right parenthesis. It may be seen from 
this example that the output of the lexical analyzer is not 
determined completely from the characters which appear before 
it on the input stream but is also based on characters which 


__Co~routines and state functions... = Page 311 


have previously been processed. Thus, if the last token passed 
back had been a binary operator, then a blank preceding an 
identifier (such as BETA) is ignored, but if the last token 
had been an identifier (or constant, right parenthesis, etc.) 
then the blank preceding another identifier is interpreted as 
an operator. 


The lexical analyzer can most naturally be described by state 
transitions. For example, after having processed a left paren- 
thesis, the lexical analyzer is in the same state as after it 
has processed a binary operator. Also, after having processed 
a right parenthesis it is in the same state it is in when it 
has processed an identifier. Though this simple example only 
depicts two such states there are in fact several others. 


States are most naturally represented by a location within the 
program which is currently being executed. Now this presents 
an anomaly if, as frequently happens, the syntactic analyzer 
calls the lexical analyzer for each token. This is because 
called functions do not normally ‘remember! their state but 
rather begin each computation afresh from some fixed entry 
point. 


We may at this point wonder if we had not got things backward. 
Maybe the lexical analyzer should call the syntactic analyzer 
each time it wants to dispose of one of its tokens. But then 
the shoe is on the other foot. The state of the syntactic 
analyzer is also best recorded by means of a location. 


This dilemma is resolved by a co-routine linkage. The jump- 
and-set-link instruction, common in most machines, can jump to 
a location and simultaneously set a register to the current 
location. By means of this instruction the lexical analyzer, 
when it wishes to return to the syntactic analyzer, can jump 
to a common return point which can save the contents of this 
register and use this as the start up point when the lexical 
analyzer is reentered. From the point of view of the lexical 
analyzer, it is like calling the syntactic analyzer. Actually, 
a little section of code is needed to make it seem as though 
each is calling the other in an entirely symmetric way. 


We may at this point step back and wonder why the need for co- 
routines is not felt more frequently than it is. Certainly it 
cannot be the inappropriateness of modeling computational 
behavior by state transitions as this is very common. The 
answer must lie in the fact that few functions require shifts 
in entry point to operate effectively. A shift in entry point 
implies that the next computation will depend on the ones 
which went before; that is, the function is non-homomorphic.* 


Non-homomorphic transformations are frequently homomorphic if 
the units are made large enough. Thus, lexical analysis, when 


*Recall from Chapter 3 that a homomorphic string transforma- 
tion T is one such that T(S, So) = T(S,) T(So).- 
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considered on a token .basis, is non-homomorphic but is 
homomorphic on a per-statement basis. This is, in fact, one 
of the advantages of a string language (or a list language). 
Entire sequences may be ported across functional boundaries 
which may then be aligned with the natural decomposition of a 
problem into homomorphic transformations. 


Such decompositions alone, however, are not sufficient, neces- 
sarily, to reduce the complexity of large practical problems 
simply because the natural homomorphic transformation may be 
considerably complex (as is the case with a compiler). This, 
incidentally, is why simple co-routine examples don't exist. 
Simple examples tend to be homomorphic or at least expressible 
as simple homomorphic transformations. 


As stated above, the conventional co-routine protocol requires 
a jump-and-set-link instruction. No such facility exists in 
SNOROL4 nor can one be programmed. The main reason for this 
is that in order for a statement to be pointed to, it must 
have a label; the ‘pointer' is a string (identical to the 
label) and goto's are permitted by indirection (unary $). The 
&6STNO and &LASTNO keywords provide statement numbers which 
could be quite useful in this regard except for the fact that 
these numbers are entirely descriptive. No mechanism exists 
for going to a statement with some given number. 


In any event, it is not clear that a direct translation from 
assembly language is the form most useful to the SNOBOL4 
programmer. It is, in fact, more likely that we would want 
something closer to the normal function mechanism in which ar- 
guments are passed, values returned and temporaries saved. 
This is provided by the state function. 


Ca en ee ee 

{{ Program {| A state function is one whose next entry 
11 4.7 1! point (its state) is determined by the 
{| STATEF Ul return. In particular, in our rendition, if 
LL ________ the next entry point is to be label ENTRY_2, 
then the goto should take the form 


: (RET ("ENTRY_2*)) 


Returning from a state function is done only by calling 
RET (label). 


We Gp Pe re ee a eee ee ee eg ee eg ee Ee ee oe ae ee 
! A State function is defined by a call to STATEF. It must | 
| not execute a RETURN but must pass control back via a call | 
| to RET(NEXT) where NEXT is the next entry point. ] 
t — en eluent ake i ie ating eames ge arama nck . 
DEFINE (*STATEF (PROTO,LBL) NEWL') 
DEFINE ('RET (NEXT) NAME") : (STATEF_END) 


We ee Ae n,m RO ge eS ee ned Un Oe RE eT ae PR ee Th ae PEN eT ee 
{ Entry point for STATEF. Determine the nominal entry point | 
| (LBL) for the state function. Then create a new Label | 


{ (NLBL) which will serve as the real entry point for the |{ 
{ function. { 
Assis ssi emis iiss iirc shes me ict chmienbsiseimtineanisimciici 
STATEF PROTO IDENT(LBL) BREAK('(*') . LBL 

NLBL = LBL ‘'_ENTRY! 

DEFINE (PROTO, NLBL) 
CS ee re eT? a) De a ents poke at dale Ge he tiene pe aCe eA et pe ee at eel cee 
| At this entry point we push our name so that upon return |{ 
{ we know what function we were in. { 
[ee | 

CODE(NLBL " PUSH('" NLBL "*) 3($" NLBL ")* ) 

S$NLBL = LBL : (RETURN) 


Gr Se eg a eae pene Peet ee ee ee ee ee ee eae ee ee ee 
{ Entry point for RET: Get the name pushed on entry. Assign | 
{ our argument (NEXT) to this name so that we know where to | 
1 come back to next time. Then indicate a return. { 
Pi an rp ne a Ne en Ea ee eee | 


RET NAME = POP() 

$NAME = NEXT 

RET = .RETURN : (NRETURN) 
STATEF_END 
Names_referenced . Name. Type . Where defined 
by_STATEFs PUSH Function Program 5.5 

POP Function — Program 5.6 

Epilogue 


An example of the use of STATEF is given in Exercise 14.18. 


tt {1 The functions PUSH, POP and TOP (Progs. 5.5, 
{I 14.8 uf 5.6 and 5.7) are fine if you only need one 
| tI stack. What should one do if one. requires 
ea EREa I REEREReEneenananeemnaaen more than one stack? We could provide an 
optional second argument to designate which of several stacks 
are intended. For example, PUSH(V,N) could push an item V on-~ 
to a stack designated by N. The principle disadvantage of this 
approach is that it produces code which lacks clarity. Another 
disadvantage is that an extra instruction must be executed in 
a rather simple function resulting in inefficiencies. To cor- 
rect these deficiencies,, we will incorporate the name of the 
stack into the name of the function. For example, PUSHA(V) 
will push onto stack A the value V. In general any string may 
take the place of 'A' as a stack designator. 


To automate the process of creating the stack functions, we 
will write a function STACK(suffix). STACK will define three 
stack-manipulation functions, POPsuffix, PUSHsuffix, and 
‘T0Psuf fix. For example, STACK('A') will define the three 
functions, PUSHA(V), POPA() and TOPA(). 
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DEFINE ("STACK (SUF) S*) 
DATA (' LINK (VALUE, NEXT) "') : (STACK_END) 


Cre ce oe ee ee a age em eae he eee ae oe age ae Ee eee 
{ Entry point: Assign to S a long string equal to the code | 
{ we have to create except that the string 'SUF' is used | 
{| where the suffix will eventually be placed. | 
cere coeenresemnnnpeeae esp sss DSPs fs cl + l > l-essss ll ee-renenasameneensnan-emsmancerstaannial 
STACK S = . 


+ ' PUSHSUF STACK_SUF = LINK (V,STACK_SUF) 3! 
+ ' PUSHSUF = .VALUE(STACK_SUF) |: (NRETURN) ;' 
+ ‘ POPSUF IDENT (STACK_SUF) 2S (FRETURW) 3! 
+ ' POPSUF = VALUE (STACK_SUF) 3¢ 
+ ' STACK_SUF = NEXT (STACK_SUF) : (RETURN) ;! 
+ 'TOPSUF IDENT (STACK_SUF) _ 2S (FRETURN) ;! 
+ ' TOPSUF = .VALUE(STACK_SUF) : (NRETURN) ;' 


ee ee 

{ Now we create the required code and define functions. { 

a I a eed 
- CODE (REPL (S, 'SUF', SUF) ) 
DEFINE (*PUSH' SUF '(V) ‘) 
DEFINE(*POP' SUF *()' ) 


DEFINE('TOP* SUF t(j)! ) 2 (RETURN) 
STACK_END 
Names_referenced Name Type Where defined 
by_ STACK: REPL Function Program 3.15 
Epiloque 


Note the use of the REPL function to create code. It is 
possible to avoid the use of REPL by a judicious concatenation 
of string constants and variables (try it) but it is im- 
possible to avoid going mad in the process. 


c 


eee a 
| Exercise 14.1 {| If we attempted to define MAX (X,Y) by means 
nnn = OOF: 


DEXP (*MAX (X,Y) = X 3; MAX = GT(Y,X) Y ‘) 


we would experience a difficulty. (a} What is it? (b) What 
simple change in this call will correct things? 


Cee ee ee ee 

{ Exercise 14.2 | Modify DEXP (Proq. 14.1) so that iden- 
__—_____-_____1  tifiers following the argument list are 
regarded as function temporaries (requires modifying one 
statement). 


Se Sor ee Te 
{ Exercise 14.3 {| The encoding of LOADEX (in Prog. 14.2) as- 
______-_-.______-J.  sumes no syntax error in the external code. 
Modify LOADEX so that if the external code contains a syntax 
error it will print out the code and establish a function body 
which will always fail. 


ee ee ee ee ee 
{ Exercise 14.4 | Rewrite DEXTERN so that it operates by 


_______-___§ tracing. That is, on first call of the in- 
dicated function, a routine is called which loads the function 
(you may use LOADEX to simplify matters). Be sure to issue a 
STOPTR after loading the function. 


SSS 

{ Exercise 14.5 {| A particulary long program consists of sec- 
——___-________J tions labeled L1, L2, ..., L100. Not all 
of these sections are in use in any given run. But, depending 
on the data, any section could be reached. Using LOADEX, how 
could you replace these sections with something smaller? 


ee ee 
| Exercise 14.6 { Encode FTR_CALL and FTR_TRC to trace func- 
tJ tions as required by FTRACE (Prog. 14.3). 


Naame eam rere! | 

{ Exercise 14.7 {| Should the definition of FTR_CALL and 
—-—_____________-J.  FTR_ RET precede or follow the definition of 
FTRACE or does it not make any difference? 


CS ne ee 

| Exercise 14.8 | Modify INSULATE (Prog. 14.4) so that it 
i—_______._§ doesn't depend on TRACE to obtain control 
on calls or returns. 


Goose Se 
{ Exercise 14.9 |{ How could INSULATE be used to guard against 
LJ. modifications of the ARB variable? 


ee 
{ Exercise 14.10 { Define a complex number by the structure 
peasant nee ee re eT 


DATA (*CCMPLEX (R,TI) *) 


where R is the real part and I is the imaginary part. With 
the help of REDEFINE (Prog. 14.5) extend the binary operators 
+, ~-, *, 7 and the binary functions GT, GE, LE, LT, EQ, NE to 
operate on complex numbers if one or Loth of the arguments are 
complex. To simplify things, write a generalized argument 
processing function which will succeed if both arguments are 
not complex and will otherwise fail, converting any non- 
complex argument to complex. 
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Co ye ee 

{ Exercise 14.11 {| Assuming that the binary arithmetic 
$4 operators have been redefined to operate 
on COMPLEX quantities as in the previous exercise, can the 
PHYSICAL package also be used with the VAL field a possibly 
complex quantity? Said another way, what trouble spots are 
there in compounding redefinitions along the lines suggested? 


Gert ra ee ae 
{ Exercise 14.12 { Redefine the arithmetic operators to 
L____-____..__————J operate on identically-dimensioned arrays. 


Qe ee 

{| Exercise 14.13 { Ordinarily a function such as F() cannot 
L.-J. _ set the variable F as a side effect since 
the value of F is saved at the call and restored on return. 
Strange as it seems, however, a technique exists to do 
precisely that. In particular, it is possible that F(X) will 
assign the value of X to the variable F. Define such an F. 


, naa A ERR, | 

{ Exercise 14.14 | Generalize the previous exercise. That 
L-___—------————-——J_ is, define a function DEF(NAME) such that, 
for example, DEF('F') will.establish F(X) as equivalent to: 


F= xX 
Ss 
_{) Exercise 14.15 | Rewrite STATEF (Prog. 14.7) such that on a 


L________._______I return via the call RET(LABEL) the func- 
tion DEFINE is called with LABEL the new entry point. 


| i a a aes, | 

{ Exercise 14.16 | In the epilogue to PHYSICAL (Prog. 14.6) 
L-_______-_-____--§ we expressed the quantity 200 FT. with an 
intervening asterisk (denoting multiplication). This could 


have been avoided by redefining concatenation (a purifying 
experience). What four statements need be added to PHYSICAL 
so that concatenation as well as multiplication form the 
product of physical units. (Hints: Be cautious of a circular 
definition, i.e. using concatenation to define concatenation, 
unless the recursion stops. Don't worry about the various 
predicate uses of concatenation since your program won't get 
control if one of the items to be concatenated fails.) 


Wc ee ee ne ee 
{ Exercise 14.17 {| Add an FRET(NEXT) function to provide an 
i—_-_-____________1 FRETURN facility to STATEF (Prog. 14.7). 


SS 
{ Exercise 14.18 { Draw a state transition table for a lex- 
1 ical analysis of SNOBOL4 expressions 
(i-e., assume no labels, no pattern matching, no goto-fields, 
just expressions) as follows. For each state and each token 
(left parenthesis, identifier, number, operator, etc.) direct 
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an arrow to the next state and indicate what, if anything, is 
to be returned. Implement this as a state function. 


Ce So eae ee 

{ Exercise 14.19 {| Write a function FUNCTION(NAME) that will 
L—__-_-_____.-_-—-—-5 succeed returning the null string if NAME 
is the name of a programmer-defined function. Otherwise it 
should fail. Hint: the definition of function should appear 
before every other function. For extra credit, any name 
OPSYN'ed to some other name should also be regarded as a 
programmer-defined function. 
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ven special-purpose programming languages require 
arithmetic. The original SNOBOL contained the five 
arithmetic operators (+, -, /, *, **) which operated 
only on strings (that resembled integers) within a 
limited form of expression (eg. no parentheses). 
SNOBOL3 allowed more freedom (e.g., parenthetical groupings 
were permitted) in forming expressions but retained the string 
format for representing integers. SNOBOL& broke with the 
tradition of the single datatype and introduced both INTEGER 
and REAL as separate types. Moreover, it represented these 
objects internally as machine integers and reals (i.e. 
floating point numbers) respectively. Hence, a study of 
SNCBOL4 numbers, in contrast to previous SNOBOL's, is very 
much a study of how they are represented on most machines. 


[uy] 


Most machines for which SNOBOL4 has been implemented are 
binary machines representing integers in base-two notation. 
In every case known to the author, the negatives are represen- 
ted in two's complement form. This is the binary equivalent 
of representing, say, -2 by a number of the form 999...99998. 
Hence, the range of integers is usually 


W-1 0 W-1 (15.1) 
("2 4,2 - 1) 


where W is the number of bits in the field allowed for in- 
tegers. Usually, W is the word size of the machine. For 
example, on the IBM 360/370 implementation of both SNOBOL4 and 
SPITBOL, the range of integers is [-231, 231-1]. 


The first several programs offer some examples of integer 
manipulation, the last of which (INFINIP) being aimed at over- 
coming the restrictions imposed by a finite word size. 


en ne ee 
{Program |] The. function COMB(N,M) will return the num- 
| 15.1 tt ber of combinations of N things taken M at a 
| COMB (1 time, usually written in ‘overt notation as 
____________s shown and defined below: 
is q 
{Nn ] N! 
COMB(N,M) = { { = ———— (15.2) 
(Mf (N-M)! M! 
t 4 
where N2M2>0. #£-By convention 0! = 1. For N < M the value 


of COMB, by convention, is 0. COMB(N,M) may also be regarded 
as the coefficient of X ** M in the expansion of (X + Y) ** N 
and is therefore called the binomial coefficient. It is il- 
lustrated by the easily remembered Pascal's triangle: 
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in which N corresponds to the row (starting with 0) and M cor- 
responds to the position within the row (starting with 0)... 
Note that each term may be found by adding the two elements 
- immediately above it. Hence we have a simple recursive method 
for computing COMB (N,M). A slightly more efficient method is 
used below which is based on the identity: 


1 

1. 

{ (15.3) 
| 

J 

provided M> 0. 


: 
{ COMB(N,M) returns the number of combinations of N things | 
{ taken M at a: time. { 


DEFINE ("COMB (N, M) *) _ 3 (COMB_END) 
COMB COMB = EQ(M,0) 1 —_ :S (RETURN) 
COMB = COMB(N- 1,M- 1) * N/M __— 3: (RETURN) 
COMB_END 
Epilogue 


Note that we do not write COMB in terms of factorials as this 
may needlessly result in integer overflow during the calcula- 
tion of intermediate results. An alternative approach is to 
write COMB iteratively and is to be recommended if time is an 
issue. This is left as Exercise 15.1. A rather bizarre method 
for computing COMB relies on pattern matching. This too is 
left as an exercise. 


Ce im ye ea ee oe : ; 
{{ Program {| We have seen several methods of representing 
| 15.2 ft numbers, the Roman system, the positional 
{{ DECOMEB 1 number systems (BASEB and BASE10, Progs. 2.4 
Ce ee ed and 2.5) and the factorial number system 
(PERMUTATION, Prog. 12.1 and its prologue). The combinatorial 
number system is yet another number system where a sequence of 
integers can be used to represent a presumably larger integer. 
Given a fixed number n called the nome, one can represent any 
positive integer K by a vector Kn, --- » Ka, Ky such that 


r .’ r ? cr ba ] 
1 Kn | { Ka | { Ky 1 
K = | 1 +t «ee + | | + | | (15.4) 
{nf 12 4 {1 f{ 
u J | J t di 
Moreover, if we add the restriction that: 


the representation is unique. The values Kn, -o-, Ko, Ky are 
called cogets (as opposed to digits). The combinatorial number 
system can be used to find a uniformly distributed evaluation 
of poker hands (POKEV, Prog. 17.6) and this relies mainly on 
the fact that cogets are monotonically decreasing. 


To see that the representation is unique (for a fixed nome) 
note that if the cogets assume their least value (K,=0, K2=1, 
eee, Kn=n-1) we obtain K=0. Next, we assert that if the cogets 
assume their largest value with Kn=M, then K will be incremen- 
ted by exactly one if Ky, is increased by one (to M+1) and all 
other cogets are made as low as possible. That is: 


ro cr a c 4 cr 1: 
tM { { M-1 { { M-n+1 | | M+1 4 
{ {+f 1 + oo. + | {+1 = | | 
{fn { n-1 | ! 1 { {on 
| | u Jj : J ti J 


That this is true follows from the rule of forming Pascal's 
triangle, viz. 


be nem nn ann af 
+ 
row ewe 4 


1 
(15.6) 
; 


The second of the two terms on the right is decomposed ac- 
cording to this formula and this is continued until the '1' is 
reached. 


Finally note that increasing K, by 1 increases K by 1. From 
these three observations, it follows that all integers are 
representable and that their representation is unique. 


DECOMB(S) will regard S as a sequence of cogets, i.e. a number 
in the combinatorial number system, and will return its cor- 
responding integer value. Cogets are represented as characters 
from an alphabet (COMB_ALPHA) much as we have previously done 
with positional representations. 


ee ee ee 
{ DECOMB(S) returns the decimal number equivalent of the ar- |{ 
| gument S regarded as a representation in the combinatorial | 
{ number system. t 
| GAR Sap em ES a ae A a Ea a RL ee TE | 


DEFINE ("DECOMB (S)T*) 
COMB_ALPHA = '0123456789ABCDEFGHIJKLMNOP' 
: (DECOMB_END) 
DECOMB S LEN(1) .T = | : F (RETURN) 
COMB_ALPHA @K T :F (FRETURN) 


DECOMB = DECOMB + COMB(K,SIZE(S) + 1) | : (DECOMB) 
DECOMB_END 7 


Names referenced Name Type Where defined 
by_DECOMB: COMB Function Program 15. 
Epiloque 


For additional information concerning the combinatorial number 
system see Lehmer [1964] or Whitehead [ 1973]. 


kl t INFINIP is a package of infinite precision 
in 15.3 1 arithmetic (i.e. integer) functions. Large 
1 | integers are represented by strings of 
De cere digits and so the size of integers permitted 
is not quite infinite but is limited by the maximum length of 
strings. This is generally quite large so that for all intents 
and purposes the precision may be regarded as infinite. 


INFINIP redefines virtually all arithmetic operators to handle 
large integers in an upward compatible way. This facilitates 
their use, and makes them plug-in-able to routines that have 
already. been written using conventional facilities. It also 
serves to make the algorithms themselves clearer, since they 
are written, in part, recursively. 


INFINIP has applications in addition to generating numerical 
wall-paper. For example, it can alleviate some rather severe 
restrictions encountered in base conversions (BASEB and 
BASE10, Progs. 2.4 and 2.5) and permutation generation 
(PERMUTATION, Prog. 12.1). 


Our basic operating philosophy in writing INFINIP was not 
speed. A linked-list approach would probably have been 
considerably faster. Our main goal was to produce a_ legible 
and flexible package that could serve (a) to produce the ef- 
fect and (b) as a kind of extended precision laboratory in 
which different algorithms could be tested. Techniques used 
to implement infinite-precision arithmetic can also be found 
in Knuth [Vol. 2], Blum [1965], and Collins [1966]. 


ee ee cee cea ee RS MRE nen cab eeE ween Oe AEE OE ETS OT ARS SCANS SAAD SOOTY GENES COLES SY GALL ES SE 


Se ee 
{ INFINIP - an infinite (just about) precision arithmetic | 
| package. The following operators and built-in functions | 
{| are redefined. | 
ic rg mit ile ic iit ii eng pc cts cl tam it opi aceasta 

REDEFINE ('-', "MINUS (X) Y') 

REDEFINE ( e 'GT (X,Y) *) 

REDEFINE ( 7 ‘EQ (X,Y) *) 

REDEFINE ( oe ‘GE (X,Y) *) 

REDEFINE ( e ‘NE (X,Y) ') 

REDEFINE ( 7 'LT (X,Y) "') 

REDEFINE ( e ‘LT (X,Y) *) 

REDEFINE ( 7 ‘LE (X,Y) ') 

REDEFINE ('-', ‘DIFF (X,Y) ') 

REDEFINE ('+', *SUM(X, Y) X1,X2,¥1,Y2,K") 

REDEFINE (***, "MULT (X,Y) X1,X2,K'*) 

REDEFINE (*/', *DIV(X, Y) X1,X2,Y1,Y2,T,1T1,1T2,KX,KY') 

REDEFINE ( 7 'REMDR (X,Y) ") 
re ee ee ee ee ee en a ge ee 
| Pattern definitions: { 
|S re ee EN | 


SIGN_OFF = POSs(0) ‘'-! 
LDG_ZEROS = BREAK('123456789') | RTAB(1) 
NO_DIGITS = 8 


| Utility functions { 
i a a 
DEFINE (*SMALL () ') 
DEFINE ("SPLIT (NAME, PAT) ') : (INFINIP_END) 


a ag a ee Ee ee Re ne Se ee a Te 
{ SMALL() will succeed if X and Y are small integers defined | 
{ strategically as integers whose sum or difference will not |{ 
{ cause overflow. Tactically, they are defined as numbers | 
| whose digits do not exceed NO_DIGITS. { 
tc eects ee mm scrim beeen ss spr eran i ivsmecscieoemssistill 
SMALL (LE. (SIZE(X) ,NO_DIGITS) 

+ LE. (SIZE(Y) ,NO_DIGITS) ) :S (RETURN) F (FRETURN) 


SPLIT(NAME,PAT) will split the named string into two 
parts, NAME1 and NAME2 (after removing leading zeros). It 
returns the amount of the split measured from the right. 
The split is determined by the incoming pattern (PAT); if 
| this is null the split is approximately half. 

Nisa eine tc sm eid bts tts ns momma cs esas een crassa aerate nisms mcs ceni ninaetcaotinsooemmnamasal 


SPLIT PAT = IDENT(PAT) LEN(SIZE($NAME) / 2) . 
$NAME (PAT | '') . $(NAME 1) @SPLIT (SPAN('O') | '') 
+ REM . $(NAME 2) 
SPLIT = SIZE($NAME) - SPLIT : (RETURN) 


EFS, ap ee Pe ce ae ah a re a ee eS ee ee eee 
| Unary minus - Remember, REDEFINE establishes MINUS. as the | 


{ old MINUS built-in. | 
US ee said pees iis aaltepet abel tel panna atlS cation aasncociaasaininpdpiactpeteek 


MINUS MINUS = SMALL() MINUS. (X) 7S (RETURN) 
MINUS = xX 
MINUS SIGN_OFF = :S (RETURN) 


MINUS tt x : (RETURN) 
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a ee ee ae 
| The predicates ~- They assume integers in normal form (i.e. | 
{ no leading zeros). { 


GT SMALL () . :F (GT_1) 

GT. (X,Y) :S (RETURN) F (FRETURN) 
GT_1 X SIGN_OFF = :F(GT_2) 

Y SIGN_OFF = :F (FRETURN) 

SWAP (.X, -Y) 
GT_2 Y SIGN_OFF = :S (RETURN) 

LGT (LPAD (X, SIZE (Y) ,'0'), 
+ LPAD (Y, SIZE (X) ,#0*)) :S (RETURN) F (FRETURN) 
EQ SMALL () :F(EQ_ 1) 

EQ. (X,Y) 3S (RETURN) F (FRETURN) 
EQ_1 IDENT (X, Y) :S (RETURN) F (FRETURN) 
GE = (~GT (X, Y) AEQ (X,Y)) :S (RETURN) F (FRETURN) 
NE EQ (X, Y) :S (FRETURN) F (RETURN) 
LT GE (X,Y) :S (FRETURN) F (RETURN) 
LE GT (X,Y) :S (FRETURN) F (RETURN) 
nag nm enn ae eames | 
| DIFF (X,Y) - Let SUM(X,Y) handle it. | { 
|e ese Sr a ee | 
DIFF DIFF = xX + -Y : (RETURN) 
ne ea ee ne eee 
{ SUM(X,Y) - There are essentially two cases: plus plus | 


{ and plus minus. We first reduce to cases. { 
Ns scssicacitccithni clon ithaca snl eit elit ep mice kin sei leeches ie 


SUM SUM = SMALL() SUM. (X,Y) :S (RETURN) 
SUM = LT(X,0) -(-xX + -Y) | :S (RETURN) 

Y SIGN_OFF = :S (SUM_1) 
rg ee Tg Ne ee ee TE ee ee a eS ee ee a eee 
{ Here is plus plus. Simply divide and conquer. { 


a nner ene mnencennemnnenmaresnnell 
(LT (X,Y) SWAP (.X,-Y)) 


K = SPLIT(.X) 

Y = Y + X2 

SPLIT (.¥,RTAB (K) ) 

SUM = (Y¥1 + X1) LPAD(Y2,K,'0') : (RETURN) 


SSS SS ee 
{ Here is plus minus. Make sure X 2 Y. Then add the 10's | 
{ complement of Y. pas | 
(Ee SSP NCE SS eS ce a a ee ENT 


SUM_1 SUM = GT(Y,X) -(Y - X) .  :S(RETURN) 
Y = LPAD(Y,SIZE(X),'0') 
SUM = X + 1 + REPLACE (Y, '0123456789* ,'9876543210'*) 
SOM. ‘'1* LDG_ZEROS REM . SUM : (RETURN) 
eee ee pp ee ee ee a ee ee TE ee RE RET TE ee 
| MULT(X,Y) - Multiply is fairly simply written especially | 


{ if we concentrate on reducing the size of one argument at | 
{ a time. Note that the test for small size is somewhat | 
{ different here. { 
| EE ne ae ce eC ES, | 
MULT MULT = LE(SIZE(X) + SIZE(Y) ,NO_DIGITS) 

+ MULT. (X,Y) :S (RETURN) | 
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MULT = LT(X,0) -X * -¥ 2S (RETURN) 
MULT = LT(Y,0) -(X * ~Y) :S (RETURN) 
(GT(Y,X) SWAP(.X,.Y)) 

MULT = EQ(¥,0) 0 

K = SPLIT(.X) 

MULT = (¥ * X1) DUPL('0',K) 

MULT = MULT + X2 * Y : (RETURN) 


en ene en a are 
DIV(X,Y) - First we handle negative arguments much as we 
did with multiply. The next part, more than any other 
section requires some explanation. Imagine a long division 
operation with two (rather large) digits Y1, Y2 being 
divided into two other large digits X1, X2. The trial 
divisor T1 (on top of the line) is multiplied by the 
divisor Y and subtracted from the left end of X to produce 
error term T. This term is then divided by Y to obtain a 
final adjustment. 

Ics rba llc imei il i laos eae lige epee lets ci eines op cts aed onpactonceniccoll 


DIV DIV = SMALL() DIV. (X,Y) :S (RETURN) 
DIV = LT(x,0) -(-X / Y) :S (RETURN) 
DIV = LT(Y,0) -(X / -yY) :S (RETURN) 
DIV = GT(Y,x) 0 :S(RETURN) - 
KY = SPLIT(.Y¥,LEN(NO_DIGITS / 2) | REM) ; 
KX = SPLIT(.X,LEN(NO_DIGITS) ) 

T1 = x17 Y1 

T2 = DUPL('0', KX - KY) 

T = K = ((T1 * Y) 12) 

DIV= T1 12 

T = LT(T,0) Tei1-yY 

DIV = DIV + (T/7 Y) : (RETURN) 


nn | 
{ And last but not least, REMDR. ; | 
a a ey 
REMDR REMDR = X - (X 7 Y¥) * Y : (RETURN) 
INFINIP_END 


Names_referenced Name Type Where defined 
by_INFINIP: REDEFINE Function Program 14.5 
SWAP Function Program 3.14 
LPAD Function Program 3.2 
i a | 
( £#%% EALS and Mixed Mode | RFALS consist of three fields, 
is &£ ——— ————--——-i a sign bit, the exponent (or 
{ #88 {| characteristic) and the mantissa, The exponent in- 
{%€ {| =Gicates the extent that an assumed base must be 
! ® € | raised whereas the mantissa represents the most 
t__---4 significant bits of the number. In symbols: 
exponent 
NUMBER = mantissa * base 


REALS, of covrse, vastly increase the range of numbers 
representable at the sacrifice of precision. While the par- 
ticular details of representing floating point numbers differ 
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ere er 


from machine to machine, there are none-the-less a few general 
practices which most machine manufacturers adhere to: 


The three fields of a floating point number are arranged in 
their order of significance and adjusted so that comparison of 
two quantities can be made using the same arithmetic. com- 
parator as integers. This places the sign bit in the first 
position, followed by the exponent and then the mantissa. To 
facilitate comparisons, the exponent is represented in so- 
called excess notation with the most negative exponent 
represented as 00...0 and the highest as 11...1. Also, the 
mantissa is normalized to produce, for any given number, a 
unique exponent, again, so that the comparison can be carried 
out. The mantissa is normalized by shifting it to the left 
and decreasing the exponent until further shifting destroys 
information. The mantissa is generally assumed to represent a 
fraction just less than 1. With a binary base, the lead digit 
of the normalized number is always 1 and so represents redun- 
dant information. It can, and actually has been, omitted on 
at least one machine (the PDP-11). By convention, a floating 
point 0 is represented as an all-0 word. On the PDP-11 it is 
the only bit pattern not otherwise used. ‘ 


The IBM 360 uses a base of 16 and hence the normalization 
process may not produce, in the mantissa, a leading bit of 1. 
Rather, the leading four bits must contain a 1. For this 
reason, numbers whose leading hexadecimal digit is low (such 
as 1 or 2) cannot be represented very accurately (the error as 
a fraction of the number is relatively large) and hence the 
need exists on the: 360, more than on most other mg enTHESy for 
double and quadruple precision. 


We will speak (loosely) of the range of REAL numbers and by 
this we will mean roughly the extremes of values the REALS can 
achieve. These can be very high, very low or very negative 
and are governed almost solely by the base and the maximum ex- 
ponent. We will speak of the precision P as meaning the binary 
precision given generally as: 


where M is the size of the mantissa in bits (including in- 
visible bits) and B is the base of the exponent. Approx- 
imately, the precision is the negative log (to the base 2) of 
the relative error of a number due to the finite resolution of 
the representation. 


It should be noted that integers up to 2**M, or so, can be 
represented exactly as REALs and that operations such as plus, 
minus and multiply are exact provided no intermediate results 
exceed this limit. 


The rules governing mixed expressions in SNOBOL4 are similar 
to those in Fortran. If the two operands of a binary arith- 
metic operator (other than **) or a binary comparator (GE, EQ, 
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etc.) have different types (one INTEGER and the other REAL) 
then the integer is converted to REAL before the operation 
proceeds. SPITBOL contains a DREAL type (double precision) 
and if one of the arguments to such an operation is DREAL then 
the other is converted if necessary to DREAL. 


One important difference with Fortran (or PL/I for that mat- 
ter) is that the types are not declared but are contained as 
part of the value. This means that it is possible to write a 
routine which can accept either type as argument and return a 
correct result. For example, assuming we wish to write a 
routine RECIP(X) which will return the reciprocal of the num- 
ber X, we can simply write: 


RECTIP RECIP = 1.0 7X : (RETURN) 


This routine will operate correctly whether the argument is 
INTEGER, REAL, Or DREAL. 


Ce ee 

tl Programs 11 FIOOR (X) is defined as the largest in- 
tt 15.4 & 15.5 tI teger not greater than X. CEIL(X) is 
({ FLOOR & CEIL {ff the smallest integer not less than X. 
enn ned They are both related (nonlinearly) to 
the integer conversion facility which truncates toward zero. © 


DEXP ('CEIL(X) = ~FLOOR(-X) *) 

DEFINE ('* FLOOR (X) *) : (FLOOR_END) 
FLOOR FLOOR = CONVERT (X, ‘INTEGER') , 

GE (X, 0) : :S (RETURN) 

FLOOR = NE(X,FLOOR) FLOOR - 1 : (RETURN) 
FLOOR_END 
Names referenced Name Type Where defined 
by_FLOORCEIL: DEXP Function Program 14.1 
Epiloque 


FLOOR and CEIL, in addition to illustrating how 
CONVERT (, ‘INTEGER') behaves, are of interest in their own 
right. Below, let N be an integer and let X be a real. Then: 


N > CEIL(X) == N2>xX 
N < CEIL (Xx) <== N <x (15.7) 
N < FLOOR(X) <== N<X 
N > FLOOR(X) <== N> xX 


These identities can be used to solve some interesting integer 
inequalities in a straightforward fashion. (See Exercise 
15.9.) 
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% ranscendental Functions |! A transcendental function 


{ &% 

{ ££ -———_———— is one that cannot be writ- 
{ * | ten (finitely) using the four fundamental operations 
{ € | of addition, subtraction, multiplication and divi- 
1 &-{ sion. Examples include the sine and other 


————-4 trigonometric functions, logarithms, etc. These may 
be represented as an infinite series (power series, Taylor 
series) of terms involving X**n where n = 0, 1, 2, ..«- and X 
is the argument. This represents a readily available com- 
putational method which is often the best’ technique if the 
precision of the machine is unknown; i.e. if the computation 
is to be machine-independent or if it is to be equally valid 
for single and double precision. 


Where the precision is known, a much more efficient technique 
is the so-called Chebyshev interpolation method. Since most 
libraries are written for a specific machine, this method is 
widely used and a little knowledge is helpful if only for the 
purpose of pirating existing code. Let us assume that we wish 
to approximate the function f(x) with an nth degree polynomial 
p(x) and, moreover, suppose that we wish p(x) to be the best 
such approximation in the so-called mini-max sense. That is, 
the maximum deviation from f(x) in some fixed range should be 
a minimum for all polynomials of that degree. We can im 
mediately deduce a property that p(x) must have. Suppose some 
polynomial q(x) existed which had the same degree as p(x) and 
had the same lead coefficent of x**n and was such that the er- 
ror of this approximation, f(x) - q(x), varied from a maximum 
of +M to a minimum of -M back to +M, to -M, etc. Suppose that 
there are exactly n+1 ‘such maxima. Such polynomials can always 
be constructed, as we will see. Now suppose that q(x) is not 
as good an approximation as p(x). Then each of the local max- 
ima are greater deviations than the largest deviation of f(x) 
- p(x). That means that 


(£ (x) - p(x)) - (£(%) - a(x)) = g(x) - P(x) 


must oscillate back and forth across the abscissa; this means 
that there are n solutions to an (n-1) degree equation. This 
is impossible and hence we conclude that q(x) had to be at 
least as good in the mini-max sense as p(x). This is quite 
startling in view of the fact that no assumptions at all about 
the magnitude of M were made. Polynomials which oscillate 
about the axis n times over a given interval are derived from 
the oscillatory nature of the sine wave and are known as 
Chebyshev polynomials. We have no time or space to pursue this 
fascinating topic in greater detail but we may recommend Fox 
and Parker [1968] or Hastings [ 1955] for further reading. 


The result of a Chebyshev approximation is a polynomial of the 
form 


2 n- (15.8) 
C4.) ke Cy e+ 4, ORR & 
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which is actually computed as: 
C+ xX * (Cy + X * (Co eee )) 
to minimize operations. 


It is interesting to note that approximations of this kind can 
be found by an adaptive process in which successive approxima-~ 
tions converge to the desired polynomial. Fox and Parker 
(1968, p.74] describe such a procedure originally due to 
Novodvorskii and Pinsker. Hence it would be possible to write 
a SNOBOL4 program to produce coefficents automatically for any 
given function, range and desired accuracy. 


For a known function and a fixed precision, the Chebyshev in- 
terpolation coefficients can usually be looked up. Hastings 
[1955] is an excellent source. If unavailable, Handbook [NBS] 
should be adequate. For any specific machine, there has 
probably been some _ work done towards constructing a 
mathematical library, and such sources, if they exist, can 
often provide routines carefully tailored for a specific en- 
vironment. One excellent source for the IBM 360 is IBM { 360f ]. 


The functions to follow are machine independent programs for 
computing many of the common transcendental functions. The 
results returned should be as precise as the arguments given, 
with the exception that DREAL precision in some cases may not 
obtain merely because one or more internal constants have less 
than DREAL precision. This difficulty is easily overcome and 
some exercises explore such modifications. 


One problem that arises in writing machine-independent al- 
gorithms is determining the proper accuracy. For example, 
suppose we wish to compute the sum of the series; 


SUM = X + X2 + X3 # 2.2, (15.9) 


where 0 < X < 1/2. Ignore for the moment that the sum of the 
series is 1/(1-X) and suppose that we wish to calculate the 
same result in brute force fashion. How do we know when to 
stop adding new terms? We might think of setting upa 
PRECISION variable (adjusted for each machine) such that when 
the terms of the series fall below the quantity PRECISION * 
SUM, where SUM is the partial sum so far computed, we quit. 
This method has the disadvantage of being machine-dependent 
and does not give double precision results if X is DREAL. 
Hence we will avoid this method and employ a scheme to let the 
machine tell us when to quit. This will have the happy 
property of adapting to any machine and any precision. Our 
test is, in effect: 


EQ(SUM , SUM + X ** n) 
which means that in order to add x**n to our number we have to 


shift it so far to the right that all its '1' bits are lost. 
This is implemented by saving the old value of SUMin a tem- 
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porary (T) and comparing, updating and branching all in the 
same statement at the base of the loop. The following state- 
ments compute the SUM of (15.9) according to this method. 


T = 0 
SUM = 0 
TERM = 1.0 
LOOP - TERM = TERM * X 
SUM = SUM + TERM 
T = NE(SUM, T) SUM :S (LOOP) 


The reader is cautioned that this stopping test is not equiva- 
lent to: 


EQ (TERM, 0) 


If continually multiplied by X, TERM will ultimately become 0 
(or raise machine underflow which many SNOBOL4's regard as an 
error) but not before it falls below the range of small num 
bers (a typical value is 2-128) whereas to be negligible in 
the computation it need merely be below X * 2-25 or so. Hence, 
even if underflow were not raised, the test would be quite 
inefficient. 


Cs og ee ee . 

({ Program {| SORT (Y) will return the square root of the 
tt 15.6 tt REAL number Y. The returned precision will 
1 | equal the precision of Y. The algorithm used 
i_—_______--______4 is an excellent example of Newton's Method 


for solving implicit equations, which goes as follows. Suppose 
we wish to solve the equation: 


f(x) = O 


for x, and suppose further that, given x, we can compute f (x) 
and the derivative f'(x). Starting with an estimate, x,, for 
x, we can compute f(x,).- Since this is supposed to be zero, 
we can estimate how far we are off by dividing this number by 
the slope f*' (x,). We can then modify x, to obtain a new, and 
closer, estimate xz according to the formla: 


X2 = Xy - £(x,) 7 £° (xq) 


With the new estimate, a new error and slope are calculated 
and the process is repeated until the desired accuracy is ob- 
tained. In many cases, the computation converges rapidly to a 
correct solution. The rate of convergence and the question of 
convergence are decided by algebra for any particular case. 
To determine if the desired accuracy has been reached,. we will 
wait until 
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f (xn) 
£' (xn) 


bee cume cms em od 


. 
{ 

EQI Xn + Xn - 
{ 
L 


As previously stated, this will adapt to any machine and any 
argument. 


To obtain the initial estimate, x,, we draw a line tangent to 
the curve, x = y? at the point (1,1). This curve, y = (x+1)/2, 
yields an estimate of the square root which is good for x 
close to 1, but quite poor for very large or very small values 
of x. While Newton's, method will eventually converge on the 
correct value, the error is reduced by only a factor of 2 for 
large errors; this contrasts with a factor of 2/e for small 
errors (See Exercise 15.11). Hence, for efficiency purposes, 
the numbers are brought into an acceptable range by (a) inver- 
ting, (b) dividing by 4096, and (c) dividing by 16. Powers of 
two are used for range reduction, as opposed to powers of 10, 
as these operations can be done exactly on a binary machine. 
On the IBM 360/370, the exponent is a power of 16 (for this 
reason, it is sometimes regarded as a hexadecimal machine) and 
hence, powers of 16 are used where possikle. 


DEFINE ("SORT (Y) T,ERR,SLOPE') : (SQRT_END) 


ee a ee oe ee ee eT ee Se ee ee ge ee ee 
{ Entry point: Range reduction and initialization. f 
caesarean np =sens slp hs lees eens Ses Srirsnrnsanosratnnenevacnennvsanemssssalt 


SORT LT (Y,0) 2S (FRETURN) 
EQ (Y,0) 2S (RETURN) 
SORT = LT(¥Y,0.05) 1. / SORT(1. / Y) 2S (RETURN) 
SQRT = GT(Y,4096) SORT(Y 7 4096.) * 64. :S(RETURN) 
SQRT = GT(Y,16) SORT(Y / 16.) * 4. :S (RETURN) 
SORT = (Y + 1.) / 2. 
T = SORT 


er et ee ee Te STI ee ee ee See Tp ee 
| Successively increase the precision of our estimate ] 
| TF ane Ie aa ae WY NO COED SPE Sen. RE re ees A Or WP RA AOE Nee on eae 


SQRT_1 ERR SQRT * SQRT - Y 


SLOPE = 2. * SQRT 

SQRT = SQRT - (ERR / SLOPE) 

T = LT(SQRT,T) SQRT :S (SQRT_ 1) F (RETURN) 
SQRT_END 
Epilogue 


The speed of SQRT can be increased (by about 30%) by an al- 
gebraic condensation of the inner loop. This is left as an 
exercise. 
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{{ Program {| By elementary trigonometry, if we can obtain 
{1 15.7 {1 any one of the six trigonometric functions, 
1 TRIG 14 viz. sine, cosine, tangent, cotangent, 
t———__________-J secant or cosecant, we can obtain them all. 


Cotangent, secant and cosecant are merely reciprocals of tan- 
gent, cosine and sine respectively and are therefore not 
represented as functions here. Tangent and cosine are given 
in terms of the sine. 


The algorithm for sine is from Beeler, et al [1972, p. 75] and 
.relies on the following trigonometric identity: 


Sin A = 3 sin(A/3) - 4 sin3(A/3) 


The identity is normally given as sin 3A and we speak of 
'triple-angle' formulas. Collections of such identities are 
available in many handbooks such as Handbook [CR] and Handbook 
{NBS}. This formula is a recursive formula for obtaining the 
sine of an angle in terms of a smaller angle. If the angle 
- ever becomes small enough we can say it equals itself (the 
angle is presumed to be given in radians and we assume the 
reader knows that one radian is 57.3° or 180/PI degrees). 
Again, the issue of when to terminate arises and this is done 
when subtracting off 4¥*sin3(A/3) does not modify 3*sin(A/3). 
But this test must be made before sin(A/3) is called or else 
we will have an infinite recursive plunge. Hence we do the 
test on A/3. If equality obtains for A/3 it must also obtain 
for the slightly smaller value sin (A/3). Thus the algorithm 
terminates when 4*(A/3)3 is insignificant compared with 
3*(A/3), or, equivalently, when 4*A2 is insignificant compared 
with 27. With 25 bits of precision, for example, this happens 
if A is 2-12 or so. Since A decreases by thirds, we will re- 
quire eight recursive calls or so before the function is 
evaluated. This will depend somewhat on the original argument. 
By using other identities, the amount of recursion required 
can be considerably reduced. See Exercise 15.12. 


DEFINE ("SIN (A) K') 
DEFINE ('SIN. (A) ') 
PI. = 3.14159265358979 : (SIN_END) 


Gt 
{| Entry point: reduce range to [0, 2 PI.) { 
Iss Seesmic pti ei eect aac Sct ca a a ass a 


SIN SIN = LT(A,0) -SIN(-A) 2S (RETURN) 
SIN = LT(A,2 * PI.) SIN. (A) 2S (RETURN) 
K = CONVERT(A / (2 * PI.) , ‘INTEGER') 
SIN = SIN.(A - K * 2 * PI.) - : (RETURN) 


CS ey ee eg eT en EE Le ee aE ne ey ee a ee ee Fee ae ree Th a a eee 
{ Test and return or plunge recursively and adjust. { 
| eee ar a a ea ee RE A a RET | 


SIN. SIN. = FQ(27., 27. - 4 *A* A) A :S (RETURN) 
A = SIN.(A / 3.) . 
SIN. = A * (3 - 4 * A * A) : (RETURN) 


SIN_END 
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a a a aaa TR 
| Standard identities yield other trigonometric functions. | 
fT sare er EF CT PE a ae eT 


DEXP (‘COS (A) = SORT(1 - SIN(A) ** 2)°) 
DEXP ('TAN(A) = SIN(A) 7/7 COS (A) *) 


Names_ referenced Name Type Where defined 
by TRIG: SORT Function Program 15.6 

DEXP Function Program 14.1 
Epilogue 


The reason for the separate recursive routine (SIN.) is to 
save time (no need for range checking after its done 
originally) and space on the recursive stack (no need to con- 
tinually push k). 


{{ Program {| The functions ASIN(X), ACOS(X) and ATAN(X) 
tl 15.8 1 will return respectively the arc sine, arc 
tt tt cosine and arc tangent in radians. ‘As was 
t—-——_-—_—_____-—— the case with the trig functions, a nonob- 
vious computation is required for one of the functions, and 
standard trig identities produce the other two. Since. we al- 
ready have sine and cosine we could use Newton's method to 
compute the arcs. Alternatively, we could invert the recursive 
procedure used to compute the sine. For variety, however, we 
will leave these options as exercises and consider yet another 
method for producing a machine-independent computation of the 
arcs. 


A power series expansion for arc sine X is [{Handbook, NBS, p. 
81]: : 


x3 1*3*XS 1*3*5*X7 

X + ——— + ————— + ——-——-—— + ... (15.10) 
2*3 2* 45 2*4%6 *7 

While this series converges for all |X| < 1, convergence is 

slow if X is near one. For X < 0.5, however, the convergence 

rate is quite acceptable requiring at most about P/2 terms 

where P is the precision in bits. 


A power series expansion for arc cos(1-Z) [Handbook, NBS, p. 
81] is 


5 Zz (1) (3) 2? (1) (3) (5) 23 
+ $ 


(2 2) 


42 (3) 42 (5) (2!) 43 (7) 3! 


rc—-——4 
—_ 
+ 


This series converges more rapidly in the worst case that the 
previous one. It makes use of the fact that the parabolic 
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the sine curve. The power series expansion is actually for 
the deviation between the two. After range reduction, the 
worst case value is Z = 1 and convergence may be expected in 
about P - LogeaP steps. Hence, we will define the arcs in terms 
of the power series for arc cosine. 


The two methods actually complement each other and together 
can provide a method of keeping the number of iterations below 
P/2. This is left as Exercise 15.16. 


DEFINE (*ACOS (X) K, TERM, T') 
PI. = 3.14159265358979 : (ACOS_END) 
wy 

| Entry point: Reduce the range to consider only quantities | 
{ in the first quadrant. | 
eae a cease erties ee eae ee eR SRR TD ET RE 
ACOS ACOS = LT(X,0) PI. ~- ACOS (-X) :S (RETURN) 
SS SS 
{ Initialize for the loop starting with label ACOS_1. This | 
{ is a power series for arc cosine. | 

ACcOS . 

TERM 

x 


K 
ACOS_1 TERM TERM * (2 * K- 1) * X YZ (4 * K) 


ACOS ACOS + TERM / (2 * K + 1) 

K K + 1 

T ' NE(ACOS,T) ACOS 2S (ACOS_ 1) 
ACOS 


SQRT(2 * X) * ACOS : (RETURN) 
ACOS_END | 


= 
{ Arc sine and arc tangent are defined in terms of arc | 
{ cosine. { 
a eee end 


DEXP (*ASIN(X) = (PI. 7 2) - ACOS(X) *) 

DEXP (*ATAN (X) = ACOS(1. / SQRT(1 + X * X)) 3! 
+ * ATAN = LT(X,0) -ATAN ; *) 
Names_ referenced Name Type Where defined 
by ARC: SORT Function Program 15.6 

DEXP Function Program 14.1 

CS eee 
{{ Program || LOG (X,B) will return the log of xX to the 
it 15.9 tl base B. If B is null (or absent), the 
11 LOG 11 natural log is returned. Given a method of 


rae obtaining logs to some base B, one can ob- 
tain a log to an arbitrary base B1 by the identity: 


LOG (X,B1) = LOG(X,B) / LOG(B1,B) 


and so the problem reduces to finding logs to some base B. 
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a 0 ca er ee ce 


If one were coding in assembly language, a natural choice ona 
binary machine would be base 2. This is because the exponent 
part of the real number is the integer part (actually the 
floor plus one) of the logarithm and is available with no com- 
putation. Moreover, the fractional part of the logarithm can 
also be plucked out of the exponent after successive squarings 
of the mantissa in a method described by Gosper in Beeler 
1972, p76]. 


Unfortunately, SNOBOL4 cannot generally ‘get at* the exponent 
of a floating point number (except for SITBOL). An integer 
approximation to the base 10 logarithm can be found by coun- 
ting the number of characters in a string representation of 


the number. Thus SIZE(CONVERT(X, ‘INTEGER')) returns the 
ceiling of LOG10 X. If X is larger than the. largest integer, 
however, it must be divided down. One can translate Gosper's 


method to operate on a decimal machine (which is what we have 
at this point) by raising the remainder to the 10th power for 
each succeeding digit. This is the method actually used. 


aa a a RA I aa ED | 
| LOG (X,B) will return the logarithm of KX to the base B. | 
! LOG(X) will return the natural logarithm of xX. 


rare seine scree scm un si aes ci as aio pce i ts amet i Se tm ame ens ei i i is a 
IN_10 = 2.3025850929940456840 
DEXP ("LOG (X,B) = NE(B,0) CLOG(X) / CLOG(B) ;° 

+ * LOG = EQ(B,0) CLOG(X) * LN_10 ;" ) 


ia a a 
{| CLOG will return the common log (base 10) of X. | 
Wasser aliens iin isis iis emir isis ieee ces ea me stan aes isaiseaa mana 


DEFINE (*CLOG (X) FACTOR, T, K*) : (CLOG_END) 


{ Entry point: FACTOR is initialized to 1.0 with a precision | 
{ equal to the precision of the argument X. Here we handle | 
| fractional cases (negative logs) in the event that either | 
{| the original number was below 1.0 or the number xX _ goes | 
| fractional as a result of the division at CLOG_4. { 

a | 


| SS Se ea eR er SE ee a ee 
CLOG X = X * 1.0 
FACTOR = X/X 
CLOG_1 X = L@(X,1) 17 X :F (CLOG_2) 
FACTOR = -FACTOR 


Ge ae ee ng Op eG ee TE Oe eS eL e R e ee eee  ee S  e ee a eo ee 
{ Here's the main Loop. We determine the number of digits 
{ (minus one) to the left of the decimal (K), which we may 
{ regard as a crude approximation cf the log. Reduce the 
{ log of X by this much by dividing by 10 ** K. Then find 
{| the log of this reduced quantity. : 


eevee kas 


ne seers tieemnneeaenenienenishcininwe: 

CLOG_2 EQ(X,1.0) : S (RETURN) 
K = SIZE (CONVERT(X, *INTEGER')) - 1 3 F(CLOG_4) 
EQ (K, 0) ; 2S (CLOG_ 3) 
CLOG = CLOG + K * FACTOR 
T = NE(CLCG,T) CLOG : F (RETURN) 
X = Xf 10. ** K 
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CLOG_3 FACTOR = FACTOR / 10. | 
X = X ** 10 : (CLOG_1) 


ee erent 
| If X is larger than the largest integer, we come here. | 
a ee ee 


CLOG_4 K =. 10 

Xx = X/ 10. ** K 

CLOG = CLOG + K * FACTOR : (CLOG_2) 
CLOG_END 
Names_referenced Name Type Where defined 
by_ LOG; DEXP Function Program 14.1 
Epiloque 


Since the characteristic of a number to the base 10 can be ab- 
tained by inspection, the method above is suitable for com- 
puting logorithms on the four-function desk calculator. The 
reader is invited to try a few examples for himself. 


Another method for computing log is the power series: 
In 14x = xo- x2/2 + x3/3 - x#s4 + see (15.11) 


To use this power series one must reduce large x until they 
come close to 0. This can be done in part by the SIZE method. 
To bring x yet closer to 0, the identity: 


LOG(X) = 2 * LOG(SQRT(X)) 


can be used. 


ce ee 

{{ Program {ff RAISE(X,Y) will raise X to the power Y. This 
tt 15.10 t{ function is entirely redundant if the second 
14 | operand of the ** operator is permitted to 
L-—________—__4 be REAL. It is not in many versions of the 
language and so RAISE must be included in our set. Indeed, 
its presence may suggest alternative methods for computing 
some Of our functions (certainly SORT). r 


If one can raise some number, Z, to an arbitrary power, one 
can then define RAISE(X,Y) as: 


RAISE(Z, LOG(X,Z) * Y) 
The number we will choose as Z is the base of the natural logs 
(normally designated e) and a special function EXP(X) will 
return e raised to the Xth power; EXP is normally called the 
exponential function. ef 
EXP (X) can be written as a Taylor series: 


1+ X + X2/2! + X3/73! + 2... 


A ee te te a OS RE ES SS ES OT A LN TN ARE SEE SD Se a NE AD NY RE EIS ORS REY. a 


which converges rapidly for X <¢ 1. For X > 1, we simply obtain 
the integer part (the floor) I and use the rule: 


x | X-1 I 
e = e * e 


DEXP ("RAISE (X,Y) = EXP(Y * LOG (X)) ') 


DEFINE (' EXP (X) TERM, K,T!) 
NAT_BASE = 2.718281828459045 : (EXP_END) 


cae I aA Se Ie aR: | 
{ Entry point for EXP. Reduce the range to [0,1]. { 
ARAN se Ta NE PRS POE RS CE EEE Le Ee kc aE EEE 


EXP EXP = LT(X,0) 1. / EXP(-X) 2S (RETURN) 
K = GT(X,1) CONVERT (X, ‘ INTEGER‘) 2F (EXP_1) 
EXP = EXP(X - K) * NAT_BASE ** K : (RETURN) 


Gn nae ee a ee ne eee ee ee 
{ Initialize for the power series which is summed in the | 
{ loop headed by EXP_2. | 
| RS SEE ee ete ee a Ne En ee ee Ee ES | 


EXP_1 TERM 1. 


EXP_2 EXP = EXP + TERM 

K = K+ 1. 

TERM = TERM * X / K 

T = NE(T,EXP) EXP :S (EXP_2) F (RETURN) 
EXP_END 
Names_referenced Name Type Where defined 
by_ RAISE: LOG - Function Program 15.9 

DEXP Function Program 14.1 

A EEEEEE EERE RE EE EEEREEE AEE EEEEE EES SEREREREEEEREEEREEEREEE EERE 9 
FPP PP III PPIPAVIIIPPIPIAI7F «EXERCISES 2277727222222222222222222? 
(LESSEE SLES SS SE EE ee EE ee ee a ee eee eS 


Ce eS es eee 

{ Exercise 15.1 | Rewrite COMB (Prog. 15.1) so that it com- 
—___________-—- I putes iteratively. Do not separately com- 
pute numerator and denominator as this may result in an 
unnecessary overflow. Also do not divide numbers that are not 
divisible. 


re ea | 

{ Exercise 15.2 { A rather unusual method for computing some 
—_——_____--__---J._ combinatorial functions was shown to the 
author by Dennis Allen. It uses pattern matching to count 
combinations. The pattern matcher will undergo a number of 
attempts to match and this can be used (in fullscan mode) to 
compute (however inefficiently) some combinatorial functions. 
For example, let INC(.N) increment the variable N by 1. Then, 


SFULLSCAN = 1 
N = 0 
S LEN(1) *INC(.N) FAIL 
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will count the number of characters in the string S. Rewrite 
COMB(N,M) so that it computes the function this way. 


Ce 

| Exercise 15.3 | What is the maximum number representable in 
_______________3 the combinatorial number system with nome N 
where SIZE (COMB_ALPHA) = L.. 


Cc ee 

| Exercise 15.4 { Write a function COMBDE(K,N) which converts 
_______.__.-—J_ integer K into a representation in the com- 
binatorial number system with nome N. If there are insufficent 
characters in COMB_ALPHA, COMBDE should fail. 


fen ee ee ee ‘ 

{ Exercise 15.5 | Since SPITBOL does not allow redefinition 
J. of operators, the INFINIP package (Prog. 
15.3) must be modified to run under that processor. (a) What, 
for example, would DIFF look like under such a modification? 
(b) How many statements in DIV would require modification? 


CF ee epee 

{| Exercise 15.6 | Augment the INFINIP package by adding the 
3. ** operator. Do not multiply out the in- 
dicated number of times but use the rule: 


N (N/2) *2 REMDR (N, 2) (15.12) 
x = x * XxX 
Ce a ee 
{ Exercise 15.7 | In the DIV procedure of the INFINIP 


LL... package, a better estimate of the trial 
quotient can be obtained by making the first digit of Y higher 
(better to be 9 than 1). This can be done by multiplying both 
X and Y by the same quantity. See Knuth {Vol. 2, p. 235]. 
Implement a scheme to make sure that the first digit of Y is 
at least 5 (requires only one additional statement if SUBSTR 
(Prog. 3.9) is used). 


CS ea ee en 

{| Exercise 15.8 { Write a function ROUND(X) which will return 
t_,__--___1 the nearest integer to xX (on ties, pick 
either). This requires three statements. 


{ Exercise 15.9 | Let X, Y and Z be positive real numbers. 
_____________- For what values of X will 

FLOOR (Y / X) < Zz 
Using the relationships in (15.7) and the fact that 

N>M <==> N2M+#1 


for all integer N andM, give a step-by-step proof of your 
answer. 
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OPN te er he eee a 

{ Exercise 15.10 { To improve the speed of SQRT (Prog. 15.6), 
L—__------.--__-—_J replace the three statements at label 
SQRT_1 by one. 


SS ee ee 

{ Exercise 15.11 { Let e represent the error of an approxima- 
LI. tion & to the square root of the quantity 
x2. That is ; 


e = R - x 
One iteration of Newton's method produces a new error. (a) 
Derive a formula which yields the new error E in terms of the 
old error e. (b) Assuming an initial error of 0.1, how many 


iterations will produce an error less that 10-29 ? 


See ee 

{| Exercise 15.12 { Given the formula for sine 3A, deduce a 
J formula for sine 9A. Recode the SIN 
routine of TRIG (Proq. 15.7) accordingly. Can the same stop- 
ping criterion be used? : 


Cr a ge ee 
| Exercise 15.13 { If the second statement of SIN. () had 
t.———__—----_-__--1 been: 


A = SIN.(A / 3) 


a bug would have been introduced. For which values of argument 
A would SIN(A) then yield an incorrect value? 


Cera eee ‘ , 
| Exercise 15.14 { Compute ASIN(X) using SIN(A), COS(A) and 
t______-___________J Newton's method in a manner similar to 
SQRT. Use X as the original estimate of ASIN(X). 


{ Exercise 15.15 { To express arc sine recursively, one may 
i_______--______J use a half-angle (or fractional angle) 
formula in order to reduce the range. One such is: 


SIN(A 7 2) = SQRT((1 - SORT(1 - SIN@A)) 7 2) 


(a) Express ASIN(X) in terms of ASIN(X / 2). (b) If one were 
to use the recursive formula to implement ASIN(X), what stop- 
ping criterion would one use? 


Ga a ee ee 
{ Exercise 15.16 | Using the power series of (15.10), modify 
__________________3 ACOS as suggested in the text. 


Ce rs eee 

| Exercise 15.17 {| In LOG (Prog. 15.9) we depend on being 
L_______-___.____J_ able to convert REALs to INTEGERs for all 
reals in the range (0, M). That is, we suppose that the max- 
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imum integer is greater than M. What is M? (Hint: the answer 
is not 101°.) 


Se ee ee eee 

{ Exercise 15.18 { It is not strictly necessary to insert 
L_——_-________J_ numeric constants into the programs TRIG, 
ARC, LOG and RAISE. Rather, they may be computed by ap- 
propriate calls on the defined routines. Modify the routines 
so that they compute the constants. 


Cn ee ee ee 

{ Exercise 15.19 { Assume you are writing an assembler and 
________._ ss omust) «construct a real number in its 
machine form for a binary machine with 27 bits of precision. 
Given other functions in the book (Chapter two), this reduces 
to the following problem: given a non-zero real number X, find 
the exponent N and integer I such that 226 < I < 227 and 


N I 
X = (approx.) 2 * —— 
27 
2 


Using LOG (Prog. 15.9), N and I can be computed in three 
statements. What are they? 


a re ee 

| Exercise 15.20 {| In order to make the random number 
___________.-__-.-J_ generator (RANDOM, Prog. 16.1) go back- 
ward, we need to be able to find the inverse of a multiplier. 
That is, we need to solve for X in: 


X * R = 1 (Mod M) 
This can be done by noting that: 


M-2 
X = R (Mod M) 


Assuming that M-2 multiplications may be too time-consuming, 
work out a method whereby only 2*Logg(M-2) multiplications are 
required. 


SS eee aN . 

{| Exercise 15.21 | If RAISE (Prog. 15.10) is used in SPITBOL 
___________-—J. and if a DREAL argument is given to the 
function EXP, the returned value will be DREAL but will not 
have DREAL accuracy. Why? How can one correct this deficiency 
and still return a single-precision result if a REAL is given 
as argument? (Hint: the answer requires modifying one 
statement.) 


al 


U 
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r-— tochastic or random strings have many applications 
t—, within the comouting sphere of activity. Some exotic 


t—,{ uses include poetry, choreography, play and brand-name 
m— generation, cryptographic and linguistic analysis, and 
i! even police-patrol scheduling [Aberg 1974]. Simulations 


and game-playing also make critical use of the computer's 
ability to generate near random sequences. More mundane ap- 
plications include algorithm testing and timing. 


Digital computers have the power to produce prodigious quan- 
tities of what appear to be random strings and/or random 
numbers. However, if pressed to define precisely what is meant 
by the term 'random' one must be careful. For example, Table 
16.1 contains two groups of 'random' English words. One group 
was formed by selecting words at random from a novel. The 
other group was formed by selecting dictionary entries at ran- 
dom. It should be immediately evident which source produced 
which group. Yet both groups have at least some claim to being 
called trandom English words'. ; 


! Table 16.1 One of the groups of words 
{ shown below was obtained by randomly 
{ selecting from entries in a dictionary 
{ and the other by selecting words from a 
|! novel. Is it obvious which is which? 

{ 

t 


Source A Source B 
{ ee ee ee re ae oe ce ee ae ee ae ne ae Oe Oe ee ae ee ee OD ee ee ee ee ae ee ee ee 
{ your dialectition 
( a Jemappes 
{ the profligate 
{ and disenfranchise 
| Hell opaque 


| 


To make the notion of randomness more precise we speak of a 
sample space containing a possibly infinite collection of 
things. A random selection is a selection of a single item 
from the sample space with the proviso that all items have an 
equal chance for selection. In the example above, one sample 
space was the set of dictionary entries which approximates the 
set of distinct words of the English language. The other sam- 
ple space was the set of words in a novel which approximates 
the totality of all words actually used to communicate thought 
using the English language. Note that a sample space may have 
repeated items such as the novel or they may all be distinct 
aS in the dictionary case. Note too that a sample space may 
be completely unstructured as in the two examples given. This 
may be contrasted with a sample space obtained by five tosses 
of a coin in which the sample space is a well-structured set 
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containing 32 combinations, each describable by a sequence of 
five binary digits. 


Cos se ee ae 

{{ Program {|| Random strings are constructed from random 
{1 16.1 It numbers and so this is what we must obtain 
{{| RANDOM {| first. RANDOM(N), where N is a positive in- 
J teger, will return a ‘random' number from 


the sample space {1, 2, ..«, N}.- For example, if RANDOM(3) 
were called 10 times the sequence produced could be: 


BB. Be 2S OF! 2S OT 8 


If the argument N is 0, the number returned will be of type 
REAL chosen from the sample space [0,1) which is the interval 
on the real line from 0 [inclusive] to 1 (exclusive).* Calls 
to RANDOM with different arguments may be intermixed without 
adversely affecting the generating process. 


Since the numbers are produced by a deterministic process they 
are not truly random but only apparently random. It is con- 
ventional to term such processes pseudo-random. Pseudo-random 
sequences have the very convenient property of being 
repeatable. This can be important in debugging or in studying 
certain effects in greater detail. If one wishes to obtain a 
different sequence one can set the variable RAN_VAR to. some 
other value in the range [{1, 2, ..-, 414970}. For game 
playing, it is sometimes necessary to initialize the random 
number generator to a value which is indeed unpredictable. For 
such purposes one can use the clock. 


ee nt a eee we gs ee SP ee RL ge Eee 
{ RANDOM(N) will return an integer uniformly distributed on | 
1 1,2,2e.,N. If N=0, it will return a real uniformly | 
{ distributed in the interval [0,1). | 
ae ee ed 
DEFINE ("RANDOM (N) ') 
RAN_VAR = 1 : (RANDOM_END) 


Pt Eee re ee ne ee ee 
{ The REAL is produced in any case. If an integer is wanted, | 
{ the REAL is multiplied by the proper range. Note that | 
{ CONVERT Truncates rather than rounds. { 
Nn a a a ce oem rim ell 


RANDOM 


RAN_VAR = REMDR(RAN_VAR * 4676, 414971) 
RANDOM = RAN_VAR / 414971. 
RANDOM = NE(N,0) CONVERT(RANDOM * N,*INTEGER') + 1 
: (RETURN) 
RANDOM_END 


*Actually, this is a slight fiction. The number of reals 
representable by the machine is finite, whereas the number of 
reals in the interval is (uncountably) infinite. The intent 
is to approximate this interval. 
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Epiloque 


RANDOM (N) belongs to a class of generators called _ the 
congruential type first proposed by Lehmer [1951]. Given some 
integer R in the range 0 ¢< R< M where M is some integer 
called the modulus, the next value of R (which we denote by 
R') is oktained by the computation 


R' = R*¥*A (Mod M) 
or, in SNOBOL4 notation 
R' = REMDR(R * A, M) 


where A is some positive integer called the multiplier. The 
numbers will begin to repeat themselves after a certain period 
governed by R, A and M. For example, if M=10, A=7 and R=3 
(thoroughly impractical values) the sequence of R's becomes 


<a i cai: Se a Te ae ee ee 


repeating themselves every four numbers (the period is said to 
be four). A random real number in the interval is then ob- 
tained by dividing R by M. 


The congruential method is extremely important historically 
because the operation 


Rt = REMDR(R * A, M) 


can be accomplished with one multiply instruction where M is 
the natural modulus of the machine (For example on the IBM 360 
the natural modulus is 231). Use of the natural modulus is 
attractive from an efficiency standpoint but is machine depen- 
dent and can't be used in SNOBOL4 anyway because the computa- 
tion will be regarded as an error (arithmetic overflow). 


The sequence of R's will consist only of integers relatively 
prime to M. This means that a period equal to M where M is a 
natural modulus is impossible. A way around this is to use 
the so-called mixed congruential generator first proposed by 
Greenberger [1961] in which the formula 


Rt = R*A+C_ (Mod M) 


is used. For correctly chosen values of A and c, the R's will 
range through every number in the set {0, 1, ..-, M-1}. 


Another method of obtaining long periods is to use a prime 
modulus. If M is. prime, then for certain values of A the 
generator: 

Rt = R* A (Mod M) 


will cause the R's to cycle through every integer in the range 
{1, 2, eooye M-1}. Such an A is called a primitive element of 
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the field of integers modulo M (see for example, Barnard and 
Child [1955], p. 438). 


The prime-primitive pair must be such that the A*R never over- 
flows the machine. If the maximum integer is, for example, 
231-1 (as it is for most 32-bit machines), then it will be 
sufficient that A*M < 231, A list of prime-primitive pairs is 
given in Table 16.2 together with an indication of the number 
of bits of arithmetic required to avoid overflow. The choice 
of prime-primitive pair for the function RANDOM was based on 
the observation that most SNOBOL4's can represent all positive 
integers below 231, 


ae a ene ree 


| 
{ { 
{ { 
{ Smallest { Smallest | 
{ Prime Primitive Power { Prime Primitive Power | 
{ Modulus Flement of 2 | Modulus Element of 2 ] 
1 M P > M*P | M P > M*P | 
[ASPs ers Sete Sees ele ese ses Pot ree ere ne ee ee t 
{ 127 12 211 ( 10657 ~ 735. 223 { 
{ 127 29 212 { 10657 824 226 { 
{ 211 35 213 { 4409 4035 225 | 
{ 211 44 21¢ { 19423 3088 226 | 
{ 491 59 215 { 10657 7367 227 { 
{ 491 84 216 { 24281 9713 228 { 
{ 1103 117 217 { 29443 13300 229 ! 
{1 1103 156 218 { 39971 20411 230 ] 
{ 1223 421 219 1! 414971 4676 232 | 
{ 1987 451 220 1 532333 8705 233 { 
{ 1987 1017 221 { 1299709 16322 235 { 
{ 2741 1148 222 { 1798963 160658 239 { 
eens ceteecse-esneopapmmenane rn uu hs PS  ttpr h r Vs Ps-> -no -es cemsereen-cmrevrsareeat 


Tests_for Randomness 

One might suppose that there existed a single, simple test for 
randomness which could be applied to some psuedo-generator to 
determine a coefficient of randomness. Unfortunately, no such 
Single test exists. It is interesting to note that if one had 
a test to determine whether a sequence was truly random that 
test could be used to produce, by elimination, a truly random 
sequence, We would then have a contradiction in terms, since 
an algorithmic process can never produce truly random numbers. 
Rather than a single, all-powerful test for randomness, there 
exists many tests each oriented toward detecting violations of 
important characteristics of random behavior. Knuth [Vol. 2] 
and Canavos [1967] describe a number of such tests. Those 
outlined here are from Canavos and have actually been applied 
to the generators mentioned in this chapter. 
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The most common test seems to be the bins test and seeks to 
answer the most obvious question: Is each of the B .integers 
from RANDOM(B) equally likely? RANDOM(B) is called succes- 
sively N times where B is the number of bins. The number of 
numbers appearing in each bin should average out to N/B. But 
the distribution over the bins cannot be expected to be per- 
fectly flat or one would suspect nonrandom behavior. One can 
measure the extent to which the distribution deviates from 
perfection and the deviation proper for a random generator is 
given by the so-called Chi-squared distribution. The number 
of bins, B, is selected so'as to maximize the power of the 
test and depends upon the number of samples taken. For exam- 
ple, for N = 1000, the number of bins suggested is 50. 


Another popular test for randomness is the correlation test 
which determines whether numbers a given fixed distance apart 
are correlated. For example, in the Canavos series, correla- 
tion is tested for distances of 1 through 8. The extent to 
which the numbers are correlated in any given sequence can be 
calculated. Random generators would tend to produce zero cor- 
relation in the long run, but in the short run they are expec- 
ted to produce a small correlation. Observed correlations 


above or below this level are suspicious. 


When RANDOM(2) is called repeatedly, the binary sequence 
produced can be considered to be like the head-tail sequence 
produced by flipping a coin. Questions one might ask are: Is 
heads just as likely as tails? This is answered by the bins 
test. Another question is: Will heads follow heads as often 
as it follows tails? This is answered by the correlation test. 
A classic coin-tossing question not answered by these tests is 
the following: If K heads in a row are. produced, is the next 
toss more likely to be a head or a tail? One might fear that 
an artificial system of producing random numbers might be too 
‘round! and not produce enough long sequences or be too 
‘angular't and produce too many. Such questions are settled by 
the so-called runs test. A run is a sequence of heads bounded 
on both sides by a tail or a sequence of tails bounded by 
heads. The number of runs of length 1, 2, 3, ... is measured 
and the resulting distribution should close to that obtained 
from a random distribution. Like the bin test, the chi-square 
formula is used to determine if the distribution is ‘too good! 
or ‘too bad! 


It is frequently useful to know of other genrators so that if 
the results of one generator or type of generator becomes 
suspect, another may be plugged in. The following extremely 
portable generator was suggested by Kruskal [ 1969]. 


R' = R * 125 (Mod 213) 


The one multiplication by 125 can be replaced by three mul- 
-tiplications by 5 so that provided the machine can contain 5 * 
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2!3 as an integer, the computation can be done without over- 
flow. Unfortunately the period is short. 


Another method is to construct a random number generator 
according to a recipe suggested by Knuth [Vol. 2, p. 155-156]. 
One such generator is: 


R® = R * 3141 + 110795 (Mod 524288 = 219) 


Another approach is to use a standard generator with multiple 
precision arithmetic. One generator endorsed by Coveyou’ and 
Macpherson [1967] (they do not endorse many) is: 


R' = R * 25214903917 (Mod 235 = 34359738368) 


To perform the arithmetic within SNOBOL4 on the IBM 360, three 
integers are needed to contain the multiplication. This will 
slow the computation and increase the complexity of the 
program but the random numbers should be quite random. 


11 il There are techniques for combining random 
tl 16.2 1 number generators to produce degrees of ran- 
| | domness higher than either operating alone. 
1 One method, proposed by MacLaren and Mar- 
saglia [(1965] is to let one random generator shuffle the out- 
put of a second random generator. This is done in RAMM(N) 
below which will behave like RANDOM(N) except that its 
statistics will be better. It uses a Knuth generator to shuf- 
fle the output of RANDOM. 


DEFINE (*RAMM (N) K*) 


Cr ee ee ee ey ee ee ae ee Oe Pe ae I gee ee ee ee ee 
{| The following two OPSYN's make the subroutine plug-in-able | 
| to any routine already using RANDOM. { 
Teese iesstnineeseim sls nimi essen sls chs espe sessilis sda mss i ill 
OPSYN (* RANDOM, *, *RANDOM') 
OPSYN (*RANDOM*, *RAMM*) 


eee ee ce se 
| Initialize the RAMM array (RAMM_A) with random numbers ob- | 
{| tained from RANDOM. (). I 


nae setesse serene ensennenenesncspsnenpenenuenineahpest dei chest PPPs -- shen s-sCUD-oensanmneane cases call 
I = QO 
RAMM A = ARRAY('0:99°*) 

RAMM_1 RAMM A<I> = RANDOM. (0) :F (RAMM_END) 
I= I+#1 : (RAMM_1) 


re gn ee Eg ee ee eg ee eS a ee i ene ee TP eT Oe Tee ep ee ee 
| Entry point: Select an element K of RAMM_A at random | 
| Return this value and fill up the entry with a new RANDOM | 
{ value. 5 | 
| | 
RAMM RAM_VAR = REMDR(RAM_VAR * 3141 + 110795, 524288) 

K = CONVERT((RAM_VAR / 524288.) * 100,‘ INTEGER) 

RAMM = RAMM_A<K> 

RAMM_A<K> = RANDOM. (0) 
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RAMM = NE(N,0) CONVERT(RAMM * N, ‘INTEGER') + 1 
: (RETURN) 
RAMM_END 
Names _ referenced Name Type Where defined 
by _RAMM: RANDOM * Function Program 16.1 


* indicates name is referenced in the initialization section. 


ee ee 

{{ Program 1 A natural application of a random number 
| 16.3 tt generator is to produce random permuta- 
{{ RPERMUTE {ft tions. This is easy to do in SNOBOL4S. 


L__—____—___________J RPERMUTE(S) will return a random permuta- 
tion of the string S. 


DEFINE ('RPERMUTE (S) T') : (RPERMUTE_END) 
RPERMUTE S LEN(1) .T = 3: F (RETURN) 
RPERMUTE POS een (SIZE(RPERMUTE) + 1) - 1) 
+ = 7 : (RPERMUTE) 
RPERMUTE_END 
Names referenced Name Type Where defined 
by RPERMUTE: RANDOM Function Program 16.1 
So ed 
Program A one-way cipher is a notion of Needham 


11 it 

| 16.4 '! first introduced in published form by Wilkes 
(1 tt [1972]. The function ONEWAY(S) where S_ is 
t_—__—____-____-J some string will return a string the same 
Size as S having the property that it would be computationally 
prohibitive to compute S or some other value S* such that: 


ONEWAY (S) = ONEWAY(S'‘) 


That is, even knowing everything about ONEWAY to the extent of 
having a listing of ONEWAY in front of you, it is still im- 
practical to compute the original argument from the output 
obtained. 


One-way ciphers are used in password protection schemes as 
follows. A user types in his password S. The system applies 
ONEWAY(S) to obtain a cipher C. Cc is then looked up ina 
table. If a match is found the user is identified and ap- 
propriate privileges are assumed. This protects against 
accidental or malicious revelation of the table's contents. 
That is, if one, or even all, such ciphers were revealed it 
would not help a thief. He must know the original password or 
any password that would yield the same cipher as the original, 
but this he presumably cannot obtain. 


Without such a protection scheme, the collection of passwords 
is always in jeopardy. In one instance, the message of the 
day for a time-sharing system that will go nameless’ became, 
quite by accident, the list of passwords. As one wag put it, 
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the most confidential file in the system suddenly became the 
most public file. 


Other applications of ONEWAY are indicated in the chapter on 
games. 


ee ag ee ee en Ce ee ee eg ae | oe et ea ee ee a, 
| ONEWAY(S) will return a one-way cipher of the alphabetic | 


{ string Ss. | 
a Le en CIT A ae EO a A A a ne Oe eS a | 
DEFINE (' ONEWAY (S) A, SIZE,C, K,SB') : (ONEWAY_END) 


a ANA REE At a | 
{ Entry point: Initialize the random number generator (by | 
{ setting RAN_VAR) and set the alphabet A. The length of A | 


{ must be a power (PWR) of 2. { 
(scent een ess i i ti stems epi asm essai entatin tat iinesainimeniniesciivntaicael 


ONEWAY RAN_VAR = 1 
A = '‘'ABCDEFGHIJKLMNOPORSTUVWXY2Z012345! 
PWR = 5 


ee ee ee eT ee Se ee Te a ee ee ee ee ag eee 
| Now, for each character (C) within (S) determine its posi- | 
{ tion (K) in the alphabet (A). Obtain K's binary equivalent | 
{ and append it to the growing string of bits, SB. . Also, | 


{ use K to modify the 'seed' of the random generator. { 
i esas enone ine spree ner ms Spm ln rp mn il iii estima 


ONEWAY_1 S LEN(1) .C = : F (ONEWAY_ 2) 
A @K Cc : F (ERROR) 
SB = SB LPAD(BASEB(K, 2) ,PWR,'0') 
RAN_VAR = REMDR(RAN_VAR * 2 ** PWR + K, 414971) 
: (ONEWAY_1) 


a er Ct oe I re ee EE eT ee ee aN eae AT See ae Sg tee, why a al gia yilacy Ee oe 
{ Now we replace each '0' by a '01* and each '1' by a '10', | 
{ randomly -permute the string, and extract the first half of | 
{ it. | 
ree eres crensresernn tines A Ps i-inr 
ONEWAY_2 
RPERMUTE (BLEND (SB, REPLACE(SB,'01','10'))) 

+ LEN (SIZE(SB)) . SB 


nn | 
| Now repack the string from its 1-0 form into something | 
{ more amenable. | 
Eh 
ONEWAY_3 


SB LEN(PWR) . S = :F (RETURN) 
A  POS(BASE10(S,2)) LEN(1) .C 
ONEWAY = ONEWAY C _ 3 (ONEWAY_3) 
ONEWAY_END 
Names_referenced Name Type Where defined 
by_ONEWAY: LPAD Function Program 3.2 
BASEB Function Program 2.4 
RPERMUTE Function Program 16.3 
BASE 10 Function Program 2.5 


BLEND Function Program 3.7 
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Epilogue 


How difficult is it to break the cipher? .No one knows. There 
is no guarantee that someone will not come up with an al- 
gorithm to quickly find the inverse of ONEWAY, it is just not 
very likely. 


Essentially the initial argument regarded as a bit string is 
both used to 'seed' a random generator and is permuted by the 
generator. The straightforward way of cracking the cipher is 
to assume a final value for the generator and work RPERMUTE in 
reverse by running RANDOM in reverse. If the results are found 
to agree, the cipher is cracked. This points up a weakness of 
ONEWAY as presented here. We normally wish the number of 
guesses required to be of the order of the number of combina- 
tions of the original string. If this were the case, longer 
passwords would prove to be more difficult to discover. But 
the number of different modes of operation for RANDOM are 
relatively small (414970). Hence, if added security is wanted, 
a generator with a longer cycle time (such as RAMM) should be 
used. Even so, the computation required to permute a half 
million strings in the manner indicated is sufficiently for- 
midable that the writer is confidant that no one will discover 
the original string used to produce: 


*BFDDGL' 


Of course, other techniques can be used to produce one-way 
ciphers. See Evans, et al [1974] and Purdy [1974]. ° 


‘(eee eee 
{{ Program {| RCHAR (CONTEXT) will return a random charac- 
1 16.5 11 ter. The intended sample space is the set 
| RCHAR tf of all characters following the CONTEXT 
t—_____J provided as argument. For example, 


RCHAR('BR') will return ‘'A' much more frequently than, say, 
'B' because ‘A is much more likely to follow the characters 
"BR. . 


In order to write RCHAR we could pump it full of statistical 
information concerning the English language. A more flexible 
(and easier) approach is to let the user supply his own 
language sample (called the corpus) and use pattern matching 
to search for a likely subsequent character. In this way we 
do not limit ourselves to English nor, indeed, even to natural 
languages. 


To obtain a likely successor to, say, 'BR' within a language 
corpus, we may look up each occurrence of 'BR' and choose ran- 
domly from among each successor. Another apvroach is, starting 
at some random point within the string, to scan for the first 
occurrence of 'BR' and then return the character which fol- 
lows. This latter technique is much faster than the former, 
but will produce statistically incorrect results. Thus, if 
the corpus is 1000 characters long, and if 'BR' occurs three 
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EE CD EE CN SEE SO REP SESE A ACE SOE GERI 


times in positions 500, 510 and 910, then the random probe and 
forward scan would mean that the 500 or the 910 would be 
picked up relatively frequently, but that the 510 would have 
an extremely small chance of being selected. 


A compromise between these two choices is to scan the string 
for the first K instances of the CONIEXT and to choose a ran- 
dom character from among the K characters which followed. This 
greatly reduces the time required to process CONTEXT's which 
occur frequently, such as RCHAR('E'), while maintaining good 
statistics for other kinds of CONTEXT's. The encoding of RCHAR 
given below will use a compromising value for K of 2. 


peace ap stance a Oe a 
| RCHAR will return a random character following the CONTEXT | 
{ given as argument. If none such exists, RCHAR will fail. | 
Oa en a a ee le a a irs a i ee a Se i i a seca 
DEFINE (*RCHAR (CONTEXT) BX,C,P,N, RC1*) 
ae ee eR ee 
{| Initialization: Read into R_CORPUS the language corpus on | 
{ which the statistical characteristics of RCHAR will be | 


| based. l 

a eee nce iit i ee eg meager 

RCHAR_1 %X% = TRIM(INPUT) : F (RCHAR_END) 
IDENT (X, ‘END*) :S(RCHAR_END) 
R_CORPUS = R_CORPUS x * * : (RCHAR_1) 


re a ee ee ne es ee ge ee a gh, tn ee ees ge ee 
| Entry point: Prepare in P a pattern Suitable for scanning | 
{ the text beginning at cursor position N_ looking for | 
{| CONTEXT. BREAKX is used to make the scan rapid. \ 
ane cste n enensteepeecenenshen sn enh ts enh cpu sp esses neue eenencsnanee-cumpeesr-<rneuss anand 


RCHAR CONTEXT LEN(1) .C : F(RCHAR_2) 
BX = BREAKX(C) 
RCHAR_2 P = pOS(0) TAB(*N) BX @N CONTEXT LEN(1) . RCHAR 


| ” — 
| Pick up the first random character fitting the context. | 
| Scanning begins at some arbitrary point N. | 
| nN nS en Se | 


N = RANDOM(SIZE(R_CORPUS)) - 1 
R_CORPUS P :S(RCHAR_3) 
No = 0 
R_COKPUS-_P : F (FRETURN) 
Cr he ee oe RE OE ge Ee a ge a ERE Te Fee ee ee ee eg ae 


| Here to pick up the next adjacent random character. The | 
| first is saved in RC1.. | 
na een ce eeepc tales lene eens amcor ae well 


RCHAR_3 N 
R 
R_CORFUS P 2S (RCHAR_4&) 
N 
R CORPUS P 

{| Here to select from ketween these two. | 


a a nS 


RCHAK_4 | RCHAR = EQ(RANDOM(2),1) RC1  : (RETURN) 
RCHAR_ END 
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Names referenced Name Type Where defined 
by_ RCHAR: RANDOM Function Program 16.1 
BREAKX Function Program 8.2 
rm s s £ : 
{{ Program |{ RWORD is an obvious application of RCHAR. 
{I 16.6 {1 RWORD(K) will return a random word with 
tt RWORD t{ characteristics similar to other words in 
1 —__-_______——J the given corpus. K is a small whole number 


indicating the extent to which context is used in forming the 
result. That is, the next character chosen derends on at most 
the last K characters already chosen. Selection begins with 
RWORD ‘seeded! with a blank. 


aR ISR ESI aE a ES | 
Table 16.3 Below is a list of random names 


f 16.3 { 
{ produced by RWORD(K) from a list of 700 names {| 
! (R_CORPUS in RCHAR). Words chosen were in the range { 
1 of 5 - 10 characters but were otherwise not pre- | 
{ { 
{ { 
{ { 


selected. 

K = 0 K= 1 K = 2 K = 3 
 astententeetateindieeteaetentaaietetannes Sesressseess= SaaS Sea Sess se ql soe tS= { 
{ Rnztn Faundobr Joher Alton { 
{ Eebfer Einakicl Thelmsti ' Vigan | 
{ Uoaer Kolin Gringtock Young { 
{| Earlho Fssmched Clouth Rosen ] 
| Meeofr Paubin Mcdorg Haekstra { 
{ Asnegrmnmh Mormer Jordawm Repsherty | 
{ Ckwaig Feymet Paudelly Haekstraun = | 
{ Kninhaaf ‘Madicos Franic Walton { 
{ Agajfoope Halitun Cloobs Bartoliti ] 
{ Hfhclunc Mchoskyr Panscher Thatchek { 
{ Usirollbh Ralmrollan Thaman Caseyman { 
| EEdhmeucc Ffrrr Mowski Walker | 
|! Lasdctn Linestz Spaglema Lopiparo | 
{ Ghsiafee Reawstz Loobs Shallisi { 
{ Riesl Gelllar Eiter Ruscher ( 


Table 16.3 contains a number of random words 
RWORD when RWORD was 


from an addressing list. 
well as the influence of the type of corpus 


increasing K as 
The names for K=2, for example, would be quite accep- 
different 
The name 


chosen. 
table in outer galactic society. 


corpus, could be used for brand-name generation. 


RWORD, 


EXXON was purportedly chosen in this way. 


using a 


generated by 
given a corpus of 700 surnames culled 


One can see clearly the effects of 
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DEFINE ("RWORD (K) CONTEXT ') : (RWORD_END) 


ea ee ee Te ee 
{ Entry point: Initialize RWORD with a blank. { 
[Rr Ee | 
RWORD RWORD = ! § 
| so ine 8, widd@ie eh ta ea ETE ee pee aE TE ETT AO Re ee ie hn Fame eae ig ere TW 
{ Use the last K characters of RWORD (or all of RWORD if it | 
{ fails to contain K characters) as context for the next | ~ 
{ character. { 
cscs ars Shomer stl ms ii > SP i ii ein giles 
RWORD_1 CONTEXT = RWORD 

RWORD RTAB (K) REM . CONTEXT 

C = RCHAR(CONTEXT) : F (RETURN) 

RWORD = DIFFER(C,' ') RWORD C :S (RWORD_1) 


See ny 
{ Falling through means we encountered a blank. Remove the | 
{ initial blank from RWORD. If RWORD is null, try again. | 
Was cece Sli ie i ici inicio eg et ee i sii ete iin aici cete hci 


RWORD ' t = 
IDENT (RWORD) :S (RWORD) F (RETURN) 
RWORD_END 
Names referenced Name Type Where defined 
by RWORD: RCHAR Function Program 16.5 
eee 
Program RSELECT will make a random selection of one 


(1 t 

| 16.7 {1 of a sequence of strings passed to RSELECT 
tt (! as argument. The first character is taken 
_——________________J to be a break character (BC) separating 
strings in the sequence. Thus, RSELECT('{A]|BIG|CAT') will 
return each of ‘At, ‘BIG' and 'CAT' with probability one- 
third. An optional integer weight enclosed in sharp signs may 
be placed at the beginning of any alternation. Thus, 


RSELECT (' {A| #3#BIG{CAT') 
will select 'BIG' three times out of five. 


RSELECT will be used as a utility routine by several programs 
which follow. 


DEFINE (*RSELECT (S) WI,WTS,ALT, CODE, I, CODE, SSAVED, BC ') 
RSEL_TBL = TABLE () : (RSELECT_END) 


Cg Sep ee ee pet OE ARTE ee eee ge ee ee ee ee eae RT a eat a pe pees ee ate 
{ Entry point: All previously-seen arguments had been placed | 
{ in a table (RSEL_TBL) together with code to be executed. | 
{ In this case we simply execute the code. ] 
4 Se EE en ee Ee | 
RSELECT CODE = RSEL_TBL<S> 
DIFFER (CODE, NULL) :S<CODE> 

an nee ee er ee Re ee eel te ne be ee 
{ If S had not been seen before, we fall through here. We | 
{ first save the string (SSAVED) and determine the break | 
{| character (BC). For each alternate (ALT), its weight (WT) | 
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{ is determined and added to a subtotal (WTS). CODE is | 


{ produced which will assign the alternative to RSELECT if | 


{ the numbers are right. | 
Wh ep mic cco eit ee eee bone noha a cece esl 


SSAVED = S| 
S  LEN(1) . BC = : F (RETURN) 
RSELECT_1 
Wro= 1 
S POS(0) '#! BREAK('#") . Wri'@! = 


S  (BREAK(BC) { REM) . ALT = 
WIS = WTS + WT 
CODE = CODE ' ; RSELECT = LE(I,' wrs ') ¢ 
+ QUOTF(ALT) ' :S(RETURN)' 
s BC = 2S (RSELECT_1) 


( Falling through means we're done. we simply prefix the | 
{| code to assign a random number to I, fill the table and | 
{| try again. | 
eee er Me PO er ee 


CODE = ‘' I = RANDOM(' WTS ') ' CODE 

S = SSAVED 

RSEL_TBL<S> = CODE (CODE) 2S (RSELECT) F (ERROR) 
RSELECT_FND é; 
Names referenced Name Type Where defined 
by_ RSELECT: QUOTE Function Program 3.16 

RANDOM Function Program 16.1 

Epilogue 
An interesting implementation aspect of RSELECT is that it 


compiles code the first time through for any given argument. 
This makes sense for a random generator since it may be called 
Many times with the same argument and compiling code, as shown 
here, greatly increases the speed of subsequent calls. 


Moreover, the program is not more 
because of this; in fact, 
saves a second pass over the 
to produce a more. simple 
consideration than time, See 


made very much complicated 
the construction of CODE actually 
string and in this sense serves 

program. If space is a greater 
Exercise 16.5. 


Cir ee 

tt Program ({ RSENTENCE(ARG) will qenerate and return a 
tf 16.8 i random sentence according to a grammatical 
{{ RSENTENCE {1 description read in during initialization. 
CL The argument ARG represents a string pos- 
sibly containina syntactic variables which are expanded 
according to the grammar. As a simple example, let the input 
be 


<SENT>: :=the <NOUN> <VERB> the <NOUN> 

<NOUN>: :=boy{man|dog|<NOUN> who <VERB>s the <NOUN> 
<VERB>: :=bite|walk|pet]| Llick{ smack 

END 
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Then a call such as RSENTENCE('<SENT>.*") will generate, among 
an infinite number of sentences, 


the dog bites the man. 

the man walks the dog. 

the man who walks the dog who licks the boy smacks the boy 
who bites the dog. 


Identifiers in pointed brackets (here shown in uppercase for 
ease of distinction) are termed syntactic variables. Alter- 
nates are separated by vertical bar ({). Though these special 
characters may not appear within the text it is not difficult 
to provide an escape convention so that they can be (See Exer- 


cise 16.9). 


When a syntactic variable is expanded it is replaced by one of 
its alternates randomly and this alternate may in turn contain 
other syntactic variables which are also expanded. This 
process may never halt (see the Epilogue). 


The meta-language used for describing the grammar is the so- 
called Backus Normal Form (BNF) which is also referred to as 
Backus-Naur Form since the form is not normal (is not unique) 
and since Naur was a cohort of Backus. The meta-language is a 
bit awkward (the first four meta-characters are redundant 
provided syntactic variables do not contain ='s) but has the 
convenient property of being commonly understood. 


Another feature of RSENTENCE is that an expression in paren- 
theses is treated as a SNOBOL4 expression. It is evaluated 
and inserted into the text stream. Also, an identifier between 
='"s is expanded like a syntactic variable but will also have 
the side-effect of assigning the result of the expansion to 
the indicated variable. Thus 


rose {tree |turkey 


<THING>::=ro 
<SENT1>::= A =THING= is a (THING) is a (THING). 
<SENT2>::= The word *=THING=" has (SIZE(THING)) letters. 


will produce for <SENT1>: 
A rose is a rose is a rose. 
with probability one-third. An example of <SENT2> is 
The word ‘turkey! has 6 letters. 


Cther miscellaneous features of the program are as_ follows. 
Continuation is represented by a line not beginning with a 
"<', Weights can be associated with alternation using the #n# 
notation of RSELECT. 


One application of RSENTENCE is test-data generation for com- 
pilers and other processors expecting stylized input (an early 
version of RSENTENCE was used to find bugs in SNOBOL4 itself). 
Another application is in producing nonrepetitive messages in 
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an interactive environment. For example, in game playing, a 
variety of sarcastic remarks can provoke an otherwise 
apathetic player into a competitive state. RSENTENCE has been 
used in the production of prospective topics for a discussion 
group. While not all topics randomly generated are directly 
usable, they are often sufficiently suggestive and suf- 
ficiently numerous that random generation followed by a cul- 
ling process, such as the previously described brand-name 
selection, becomes an effective technique. 


Yngve [1962a] suggests that such programs coupled with a full 
and valid grammar, solve one aspect of the problem of machine 
translation, viz. the target-languege generation end. One must 
realize, however, that RSENTENCE, by itself, is limited almost 
exclusively to context-free generations and hence to very 
restrictive grammars. To aid in the machine translation study, 
RSENTENCE must be considerably enhanced. One such enhancement, 
suggested by Yngve is given in Exercise 16.8. It must also be 
realized that it is not merely sufficient to generate sen- 
tences having a variety of syntactic constructs, one must 
actually be able to perform transformations from one form into 
another. This is considered more fully in RSTORY (Prog. 
16.11). 


DEFINE (*RSENTENCE (STACK) VAR, EXP, S, TEXT") 
ee Se ee ae ee ee ag ee ge yee ee ga pre a ee ae ee 
{ Pattern initialization: | 
Sec Gane cee ORME NNN, CMRI TNO SEG OS SO OTe EE OE a REN EE | 
SYN.VAR = POS(0) '<' ARB. VAR '>!# 
SNOBAL.EXP = POS(0) '(' BAL('(<>)','t! "tm . EXP ')' 
ASGN.VAR = POS(0) ‘'=' ARB. VAR '=!# 
LITERAL. TEXT = BREAK('<=(') . TEXT 


SS SS SS SS SS ee 
{ Read in the grammar and enter the alternative lists into a | 


{ table (RSENT_TBL). | 
{SEE te nea oe ee a ee ee | 


RSENT_TBL = TABLE() 
 $S = TRIM(INPUT) 
RSI_1. S = TRIM(INPUT) 
S PpoSs(0) ('<' | *END' RPOS(0)) 2S (RSI_2) 
ss = sss : (RSI_1) 
RSI_2 ss '<' ARB . NM '>s:=' = 
RSENT_TBL<NM> = 'f' SS 
IDENT (S, *END') :S (RSENTENCE_END) 
ss = § : (RSI_1) 


a ae 
{ Entry point: The string named STACK will contain all not- 
{ yet processed information. The string S will contain the 
{ random sentence being formed. We examine the STACK for a 
{ syntactic variable, a SNOBOL4 expression in parenthesis, 
{ an assignment operation enclosed in ='s, or, if none of 
{| these, arbitrary text. 
| CREE ere cr Ce a EE ae I ee Ee LE | 
RSENTENCE 

STACK SYN.VAR = RSELECT (RSENT_TBL<VAR>) :S (RSENTENCE) 


STACK SNOBAL. EXP 


:F (RSENT_1) 


S = S$ EVAL (EXP) : (RSENTENCE) 
RSENT_1 STACK ASGN.VAR = ' $F (RSENT_2) 

$VAR = RSENTENCE('<' VAR '>'*) 

S = S_ $VAR : (RSENTENCE) 
RSENT_2 STACK LITERAL.TEXT = :F(RSENT_3) 

S = § TEXT : (RSENTENCE) 
RSENT_3 RSENTENCE = S_ STACK : (RETURN) 
RSENTENCE_END 
Names_referenced Name Type Where defined 
by_RSENTENCE: BAL * Function Program 8.3 

RSELECT Function Program 16.7 


* indicates name is refereneed in the initialization section. 


Epiloque 


A curiosity of sentence generators such as RSENTENCE is that 
it is possible to write a grammar with a chance of looping 
forever. Pohl [1967] gives the following examples: 


s2:= AY B<SD 
<S2>::= A | <S2> A <S2> | <S2> B <S2>_ 
:3:=#2# A {| <S3> A <S3> | <S3> B <S3> 


Whereas <S1> will always halt, <S2> has only a probability of 
1/2 of halting (unlike normal loops, the program will not ac- 
tually run forever because storage requirements will 
ultimately be exceeded; in practice, however, the program will 
appear to be looping because the storage growth rate is 
small). <S3> represents a ‘fixed-up! version of <S2> which, 
like <S1>, will halt with probability 1. 


The analysis of this phenomenon is based on the notion of ran- 
dom walks with ruin and is treated in detail by Feller [ 1957]. 
Let a particle on each step move either to the left or to the 
right. Let it move to the left with probability p and to the 
right with probability q so that ptq = 1. Let P be the 
probability of moving one step to the left, ever. Then P**n 
is the probability of ever moving n steps to the left. Hence 


P = p+t+q pe 


This equation has exactly two solutions, viz. P = 1 and P = 
p/q. Curiously, the correct choice does not seem to be 
deducible by a simple argument. It happens to be t if p2q 
and is p/q if p < qa. The dividing line of p=q = 1/2 is of 
interest in that the walk is certain to ultimately reach any 
point but the expected waiting time is infinite. 


In the examples above, <S2> loops because, effectively, q 
2/3 and p = 1/3. On the other hand <S3> has p = 1/2 and Q 
1/2 and so the probability of halting is 1 (but just barely). 
In <S1>, we may throw out any alternation that leads to the 
same state so that, effectively, p = 1 and q = 0. 
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Cece en Toe ae 

!t Program {| One use (one hesitates to say application) 
tf 16.9 | of RSENTENCE is in poetry generation (See 
tt RPOEM HI Milic [1970, 1971] for a general discussion 
t—__--__--_____ of this topic and other references). For 


example, if the following were the input to RSENTENCE: 


<PROP>:; :=action |duration |hunger {| feeling|activity| movement | 

motion {notion |endurance |tenderness j{age{taste] bounty j goodness 

<GEN>: :=time {nature |age |wisdom|war j peace| power j energy |earth | 

love|beauty|!charity |faith |{hope|thought|{strength|night] 

piety (heart |landjevil 

<SPEC>:; := flower (tree|dove|star|cloud]jtwig| pond{ dog|goat| 

muffin|petal {wagon wheel{gate jtrap|lark|raven|drof| dish|spoon| 

Spark {bone |brain|tooth | face jrake|shovel|book|cover|whistle 

<PREP>: :=on|up|over junder |within]|besidejofjin 

<TVERB>::=revere {worship |understand|beseech{contrcl {provoke | 

heal | pursue {strengthen { become |{kill{arouse|becalm]|ensnare 
<IVERB>::=Sing|talk {run jaspire|{twiddle|{think| gurglej|ponder| 

wiggle |bend {simmer {bask | break |tumble|{ dance|whistle]| squawk 

<ADJ>::=gentle|frail |happy{sorrowful|mournful|gay{rusty| 

frolicking {wanton |lLustful |timid|pensive|timorous| moody 

<AUX>: :=may {can {shall|jshould{must{doth 

<NOUN>: s=a <ADJ> <SPEC>}a <SPEC> of <GEN>{the <PROP> of a 
<SPEC>{the <SPEC> <PREP> <NOUN>|<GEN> <PREP> <GEN>|<GEND>'s 
<PROP>{<ADJ> <GEN>|{the <PROP> of <GEND> 

<RPOEM>::=A =ADJ= =SPEC= <AUX> <IVERB> <PRKEP> =NOUN=/And <AUX> 
<TVERB> <NOUN>./But <NOUN> <TVERB>s <NOUN>/While (NOUN) 
<TVERB>S the (ADJ) (SPEC)./ 

END 


The first four calls to RSENTENCE(*<RPOEM>') (with RAN_VAR set 
to 1) produces: 


ed Th Cee ee ey ee ae OPN et an hee a ee ee ae ee 
A lustful twig can twiddle up the tenderness of a spoon 

And can kill the motion of wisdom. 

But the brain beside gay power heals the action of earth 
While the tenderness of a spoon heals the lustful twig. 


{ 
{ 
I 
1 
{ 
A happy muffin shall bask under earth of night | 
And can ensnare the pond up charity of earth. { 
But the activity of charity strengthens sorrowful faith | 
While earth of night beseechs the happy muffin. { 

| 
A wanton gate may gurgle under the gate of the age of a star] 
And should worship a gay shovel. { 
But frail wisdom ensnares the endurance of night { 
While the gate of the age of a star pursues the wanton gate. | 

| 


I 
A moody cloud shall ponder over the motion of a shovel { 
And should beseech the goodness of beauty. { 
But war over nature worships a wanton goat { 
While the motion of a shovel strengthens the moody cloud. 

ert trees sew SPS SSeS erninmsnsi-naa exmsems ll 
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where the lines are broken at slashes. Notice that an effort 
was made to produce sentences which would be syntactically 
correct and also have some semantic soundness. For example, 
there are three types of nouns, GENeral, SPECific and 
PROPerty. One of the noun phrases is <PROP> of <SPEC>, i.e. a 
property of a specific thing, but <SPEC> of <PROP> is not 
allowed. 


One reason that the random generation of poems has. been 
popular is that context-free generators produce very little 
semantic connectivity between words. Since the poet is granted 
license to break such rules we naturally interpret text in 
which such rules are broken as poetry. As Milic {1970] has 
observed, we readily "... accept metaphor as an alternative to 
calling a sentence nonsensical." Hence, in generating random 
text it is much easier to randomly generate ‘poetry than 
prose just as it is easier to randomly generate ‘abstract art' 
than good pictures. One conceivable application of random 
poetry is as an initial exercise in a poetry-appreciation 
course. The exercise of explaining the ‘meanings’ of some of 
the computer renderings can be a mind-expanding experience. 


RSENTENCE may, as we will see, be also used for story genera- 
tion. There are, however, definite limitations in this direc- 
tion. Mendoza [1968] describes one effort to improve somewhat 
on the semantic soundness of the generated sentences. Essen- 
tially his method applied weights to different noun-verb com- 
binations so that a squirrel would munch and crunch with a 
greater likelihood than crawl and swim. This technique 
produced sentences which were internally sound but which had 
very little relation to other sentences. Hence, when Mendoza 
read sets of such sentences to his children as stories, the 
children complained because the stories never got anywhere. 


Using a vocabulary heavily sprinkled with chemical terms, Men- 
doza reported on attempts to pass off randomly-generated sen- 
tences ina chemistry examination. It is perhaps a plus for 
higher education that the teacher not only did not give a high 
grade to the computer but actually stormed into the Director's 
office shouting "Who the hell is this man—why did we ever ad- 
mit him?" Perhaps what is of interest in these stories is that 
the individuals involved did not see the computer behind the 
gibberish but accepted it as very bad human products. This is 
an advance of sorts. The problem of providing inter-sentence 
connectivity is a challenging one and will be considered after 
taking up the next topic. 
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Cen a peaaae | ; ; 
{ *###8 IMULATION {| The computer may be used to simulate real 
{ & c——————1__ events and, in so doing, may determine the 


1 #88 { outcome of certain strategies or actions far less 
{ ®% { expensively and more quickly than by concocting the 
( ###% { event physically. Simulation is used where the 
iJ events to be predicted are not amenable to 
Mathematical analysis but where the underlying stochastic 
structure is well-established. Simulations are used in busi- 
ness where transport networks, factories and shops, trading 
centers, etc. may be analyzed, in the study of warfare, 
cities, traffic, demography, biological adaptation and many 
other large and complex situations. Simulations are sometimes 
referred to as Monte Carlo techniques, but this latter term is 
more likely to be reserved for more mathematically-oriented 
situations. As a crude example, the area under a curve can be 
approximated by generating random number pairs (See Exercise 
16.13) and testing to see if they fall above or below the 
curve of interest. Other areas where simulations can be used 
is in game-playing, sports and gambling. For a specific 
Simulation we choose the game of baseball. 


Qe re ae eee . Fi 
Program The function RSEASON(NG) is intended to 


i (t 

{1 16.10 {1 simulate a random season of baseball. The 
!t RSEASON || number of games is given by the argument NG. 
ne The value returned is the number of runs 
scored in the simulation. The simulation is governed by 
statistics read in at initialization time. One example of in- 
put that could be given is shown in Table 16.4. 


{ Table 16.4 Shows the line-up and statistics 
for the 1927 New York Yankees. Source is BB 
[ 1969]. Only the data shown in lower center 


{ 

{ 

{ was actually input to RSEASON. 
| 

| 


+ t 
7 I 

Ruth { 540, 192, 29, 8, 60, 138 {| .356 
| 


{ 
| 
{ 
t 
Gehrig 584, 218, 52, 18, 47, 109 ~374 | 
Meusel 516, 174, 47, 9, 8, 45 -337 | 
Lazzeri 570, 176, 29, 8, 18, 69 309 | 
Dugan 387, 104, 24, 3, 2, 27 269 | 
collins 251, 69, 9, 3, T, 54 275 | 
Pitcher 500,° 50, Sp Te -24- 10 -100 f 


Table 16.4 shows the lineup and statistics of the 1927 New 
York Yankees, perhaps the most powerful hitting aggregation in 
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the history of baseball. The statistics given for the pitcher 
are not those of any given player but are an estimated com- 
posite of the entire pitching staff. 


The program is in a sense the simplest possible simulation 
Since only offensive data are given for only one team. A per- 
fect simulation would perhaps require that every blade of 
grass be taken into account and is completely out of the ques- 
tion from the standpoint of human effort let alone the fact 
that baseball records, complete as they are, do not show all 
such minutiae. Between these extremes, the pitcher on the 
defensive team and to a lesser extent the fielders do affect 
the performance of the offensive team as a whole and may 
peculiarly effect individual hitters. Another weakness of the 
simulation is that every player's performance is independent 
of his previous performances and, more severely, of the game 
situation. Some players are considered ‘clutch hitters' and 
pitchers tend to ‘bear down' on hitters in tight situations. 
All of these factors are worth a study of their own to anyone 
interested in a serious simulation of the game. We will be 
content with exploring the principles of simulation. As it 
stands, however, RSEASON could be used to determine the gross 
effects due to line-up changes and permutations in order to 
determine optimal line-ups or to evaluate trades, the effect 
of pinch hitters, etc. 


DEFINE ('RSEASON (GAMES) INNING, RUNS, BASES, OUTS, K*") 


eos ee a ee ee eg ae RS me an ge MGR kee ee Ee eee ye eT ee? gee 
{ A structure, RECORD, is defined to contain the statistics | 
{ of one player. STATS is an array, filled during the | 
{ initialization period with statistics of the players in j 
{ the simulated lineup. 8 { 
ace eters enc een  - U SP ee s ss -s > e  srssoessevencenensnnsorassmcenvcrennmwunsncsinaell 
' DATA ("RECORD (AB, H,DB,TR,HR,BB) *) 

STATS = ARRAY(9) , 

I = 0 
RS_INIT IT=1+1 

STATS<I> = EVAL (*RECORD(' INPUT ')') :S(RS_INIT) 

: (RSEASON_END) 

i a, ee gy ee ae a OE eRe OG EE TT AT Cg SOR ee an eT en Ce are 
{ Entry point and outer loop: Control returns here after | 
{ each complete game. Control arrives at RS_1 for each new | 
{ inning. BASES will contain the men on base in the form of | 
{ a string and OUTS is an integer recording the number of | 
{ 


outs. { 
arse es ean Ss ees mete soem eas ms cme Zn smears em ep aisi Sr ctispinsaaeboth gr aunranss Sremsaaaiieomaaeiiese 
RSEASON GAMES = GT(GAMES,0) GAMES - 1 :F (RETURN) 
BATTER = 0 
RS_1 OUTS = 0 
BASES = 


Here for each new batter. His statistics are obtained in 
Ss. A random number K is obtained based on his total at- 
bats. The variable ADV is set according to how his per- 
formance would advance runners from bases 0, 1, 2, and 3. 
The actual advancement is done at RS_4&. An exception is 


8 coe «em ene -o 
—s ame em oom oom 
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{ the walk (BB) in which advancement is context sensitive { 


{ and so must be treated as a special case at RS_BB. | 
{Ses a SSE Ee MRO PR ee aN Se a | 


RS_2 BATTER = EQ(BATTER,9) 0 
‘ BATTER = BATTER + 1 
S = STATS<BATTER> 
K = RANDOM(AB(S) + BB(S)) 
ADV = GT(K,AB(S)) '1223! :S (RS_BB) 
OUTS = GT(K,H(S)) OUTS + 1 :S(RS_OUT) 
ADV = LE(K,HR(S)) ‘'RRRR* :S(RS_4) 
ADV = LE(K,HR(S) + TR(S)) '3RRR! :S(RS_4) 
ADV = LE(K,HR(S) + TR(S) + DB(S)) '23RR' :S(RS_4) 
ADV = '12RR' 
RS_4 BASES = REPLACE(BASES 0, '0123', ADV) 3 (RS_2) 
RS_BB BASES "321% = 421* 
BASES "218 = 31° 7 2 (RS_4) 
Sa RA I AAS RA AR a aa a Ta a a ae RES | 
{1 If there are not three outs, determine the number of RUNS | 
{ scored this inning by scanning BASES. Add to total |{ 


{| (RSEASON). Then check to see if we've completed 9 INNINGS. |{ 
| EET | 


RS_OUT EQ (OUTS, 3) 2F(RS_ 2) 
RUNS = 0 
BASES SPAN(*R*) @RUNS 
RSEASON = RSEASON + RUNS 
INNING = INNING + 1° LT(INNING,9) 7S (RS_1) 
INNING = 0 : (RSEASON) 
RSEASON_END 
Names_referenced Name Type Where defined 
by RSEASON: RANDOM Function Program 16.1 


One of the most important aspects of a simulation is how to 
interpret the numbers. For example, to simulate a season we 
may call RSEASON(154) and find that 978 runs were scored. But 
repeated calls to RSEASON(154) will produce slightly different 
numbers. An actual sequence obtained was: 


978 1013 1068 1004 886 999 1053 1039. 


These eight numbers average to 1005. In general, the more 
numbers we obtain the closer these numbers approach some 
limiting value. Since computation can be expensive and time- 
consuming, we may well ask how far we must pursue the 
statistic-gathering before the average settles down to 
something reasonable. Said another way, how can we estimate 
the error of such a computed average? 


Let M be the mean of n numbers X, Xo .-. Xn- That is 
It is well known [Feller 1957] that if the X4, Xo, ... ,Xn are 


independent then no matter what their distribution (assuming 
their means and variances are not infinite), their sum S 
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S = X, + Xo + oe. + Xn 


approaches a Gaussian distribution whose standard deviation 


(Or standard error) E can easily be estimated from the 
formula: 


The sum S will be in error by about E. Moreover, we may be 
95% confident that S is within + 2E from the average value. 
Hence we may with the same confidence (95%) expect that the 
asymptotic average will ke in the range: 

S/n + 2E/n 


As an example, given the previous 8 numbers, we obtain 


E2 = 729 + 64 + 3969 + 1 + 14161 + 36 + 2304 + 1156 
= 22420 

E = 150 

S/n + 2E/n = 1005 + 37.5 


For long sequences of numbers, (16.2) is not in the most con- 
venient form, since the mean M is not available until the last 
number Xn is. seen. Rewriting (16.2) using (16.1) we obtain: 


Ee = (X,2 + Xo? + eee + Xn?) on M2 (16. 3) 


Note that E? varies roughly as n and so E/n varies inversely 
as the square root of n. Hence in order to reduce our range 
of error by a factor of K we must gather K2 times as many 
statistics. Hence, precision is expensive and, for this 
reason, Simulations are used only when analytical techniques 
are not available. 


To determine the effect of modifying the batting order, 
RSEASON(154) was called 45 times with the lineup as indicated 
in Table 16.4 and 45 times with Ruth and the pitcher inter- 
changed. In the first case the average runs scored per season 
was 1009 +14 where 14 is the 95% confidence interval. In the 
second case the average was 971.5 +14. The experiment clearly 
shows the efficiency of the given lineup over the postulated 
one. 


One curiosity remains however. The number of runs the Yankees 
actually scored that season was 975. This in spite of the fact 
that pinch hitters, clutch hitting, extra-inning games, errors 
and better pitcher-hitting than .100 would have made the ac- 
tual figure higher than the simulated figure. On the other 
hand, the Yanks won 110 games that year. If say 70 were won 
at home then they missed one inning out of twenty which would 
account for 50 runs. Almost certainly, good clutch pitching, 
if not choke hitting, could account for the rest. 
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CSS ety ee ; 
Program As indicated by Mendoza (Epilogue to RPOEM, 


1 i! 

{1 16.11 tI Prog. 16.9) sequences of sentences which 
{{ RSTORY tI bear little coherence one to the other are 
4 not particulary interesting even to children 
let alone the flabergasted professor. At first sight, the 
ability to. produce an actual story may seem quite beyond the 
state of the computer art. However, it is not essentially 
difficult to supply the desired connectivity by using some un- 
derlying simulation to form a developing plot and use the ran- 
dom sentence generator to supply verbal "suguring'. This is 
amply illustrated by the baseball simulation (RSEASON) which 
would be quite easy to modify to produce a 'meat and potatoes! 
narration such as: "... Ruth makes out, Gehrig hits single, 
Meusel makes out, End of inning, no runs ... ", etc. For the 
purpose of story-generation, descriptive phrases, chosen at 
random could further embellish the tale adding needed color 
(See Exercise 16.16). 


For the generation of stories which may appeal to children, a 
child's game may be simlated. There are many games on the 
market in which tokens moving over a board carry the child 
through a sequence of adventures often with a competitive ele- 
ment thrown in which would make the story interesting. Board 
games, such as Monopoly, have been programmed and most 
children's games are considerably less complicated than this. 


One method of producing random stories which only vary weakly 
from each other is to locally perturb certain variables of a 
given pre-concocted story. There are children's books on the 
market which utilize this principle in producing personalized 
books. In addition to using this principle, RSTORY, below, 
attempts to utilize a collection of semantically rich (or at 
least richer) information of the form <agent> <adversely 
operates upon> <agent>. RSTORY draws upon these relationships 
in order to produce a simple ‘actor-action' chain which this 
classic children's story requires. 


Process phrases - We assume that RSENTENCE has read in all 
syntactic variable definitions. All phrases are of the 
form SUBJECT VERB OBJECT. For each object expressed or 
implied in a phrase, we make an entry in the table ACTIONS 
which will contain the subject and object. 
ee TE | 
ACTIONS = TABLE() 
BB = BREAK(' ‘) 
SB = SPAN(' ‘) 
READ_PHRASE 


X = TRIM(INPUT) : F (BEGIN_STORY) 
IDENT (X, ‘END ') :S (BEGIN_STORY) 
X (BB SB BB) . SUBJ_VERB SB REM . OBJS 
OBJS = OBIS ‘|! 

READ_PH1 : a 
OBJS POS(0) '<' ARB . VAR '>' = RSENT_TBL<VAR> 


OBJS POS (0) as: = 7S (READ_PH1) 


eee cee eee eee ee aero ae aes Ee SS Se ST SARS SE ED RE SN CN SEES SOP “PETS eR AO mie pT YO ORD Ce 


CBJS BREAK('{') . OBJ "ft = : F (READ_PHRASE) 
ACTIONS<XOBJ> = ACTIONS<OBJ> '{* SUBJ_VERB 
: (READ_PH1) 
SS eer 
{ The story's setting and the principal characters are in- | 


{ troduced here. { 
a a eee 


BEGIN_STORY RSTORY = RS ENTENCE ('<OPENING>') 
LIstT = * *¢' PET ™ won't jump over the " BARRIER 
LAST = PET 
&MAXLNGTH = 30000 


' ; z 4 
{ Find a new agent; we will try ten times to produce a verb | 
| and an agent that we haven't seen before. { 
(Ee ee are a ee Ed ER ee Ce Le EOS Se OP ER RT LEE ED ETE Oe a ERI TS | 


NEW_AGENT 
TRY = O 

RETRY TRY = TRY + 1. LYT(TRY, 10) : F (REQUEST) 
ALTS = ACTIONS<LAST> 
RSENTENCE (RSELECT(ALTS)) BB . SUBJ SB REM . VERB 
RSTORY ' ' suBJ ' ? 2S (RETRY) 
RSTORY ‘' ' VERB ' ¢ 2S (RETRY) 


Grae en a ee Se TR) aE VE meee Ea Pe ee Cpe ae! al ET Oe ey ay oy ee 
{ Here the refusal is added to the story as well as descrip- | 
{| tive text relating to finding a new agent and making a | 


{| request. | 
Tse nie apc es ime cite is ee ea ie feat i nace 
REQUEST RSTORY = RSTORY RSENTENCE ('<REFUSAL>') 
LIST = ' * SUBJ " won't "™ VERR ' the ' LAST ", " LIST 
LAST = SUBJ 


Cr ee ee pe Siero ere a RE RT a ee Oy ey Pe Pp Ae ee ere De 
| If the agent complies freely with the request, control | 
{ falls through the next test and the story is essentially | 
{ over. | 
Orsini i nial sven minnie epee ahr hamecinctmengntarepenl. 


LT(SIZE(LIST), 175) :S (NEW_AGENT) 

FIN1 LIST "won't" = “began to" 2S (FIN1) 

FIN2 LIST ',! = "3; thet 2S (FIN2) 
RSTORY = RSTORY RSENTENCE ('<PERSUADED>') 


Ee eae ge ee Ee pe ee eg ee nn ee 
{ Now output the story. { 
eee cen cerenesensnneenensapsesttnenn ers re sh apr sl th hth as -E  PEsSunarseseetafven.aaunienassnesnavenemenennasitll 
lelehy RSTORY (LEN(50) BB) . OUTPUT SB = 3S (OUT) 

OUTPUT = RSTORY 


eg ORE RT ee Ee ee pe ee en ee eg ete ee ee eT 
{ Below find the input data to the program. The first half | 
1 (up to END) is processed by RSENTENCE. Following this we | 
{ find the phrases on which the story is based. { 
We riintciniertesarmns tanneries sci mse sis ian ie sei sn sippy nieseainyhinkcaipmrastietonlh 
END 
<OPENING>: :=<TIME> there was a =CHAR= who went to <PLACE> and 
bought a =PET=. On the way home they came upon a =BARRIER= 
which the (PET) was afraid to cross. The (CHAR) said "(PET), 
(PET), jump over the (BARRIER) or I won't get home tonight." 
<TIME>: :=Once upon a time{Once{Long ago in a small village| 
In days gone by in a little town by the river 


Page 366 Ss Chapter _16 - STOCHASTIC STRINGS 


<PLACE>::=market{a pet store|a super market{town|[the city 

<BARRIER>: :=fence|ditch{fallen treejlarge rock] stream| brook 

<PET>: :=dog|cat| parrot {pony 

<REFUSAL>::= But the (LAST) would not. The (CHAR) 
<EXCURSION> and she met a (SUBJ). She said, “(SUBJ), (SUBJ), 
(VERB) (LAST), (LIST) and I shan't get home tonight." 

<EXCURSION>: s=went down the path|went over a hill|went by 
<OBJECT> and then <EXCURSION> {went toward <OBJECT> | 

went over hill and dale|went near <OBJECT>{went on the road to 
<OBJECT>|{went for (RANDOM(20) + 1) miles 

<OBJECT>: s=the <COLOR> <THING> 

<COLOR>: : =white| blue|red{yellow|{grey|black| dark|green|jorange 

<THING>: :=mill{tavern|church {school |house|meadow]{rock{|barn 

<PERSUADED>::= The (SUBJ) knew the (CHAR) and, in fact, 
had been saved by her from a wild <WILD_AN>. So the (LIST) 
and the (CHAR) got home that night. 

<CHAR>: :=little old woman{little old lady{kind grandmother | 

kind old aunt{little girl dressed in red{retired seamstress| 

nice old lady{little girl green 

<DOM_AN>: :=cow|pig{horse|{sheep| chicken 

<WILD_AN>: := lion {giraffe (tiger |camel|ostrich|rhinoceros 

<ANIMAL>: :=<DOM_ AN> {<WILD_AN>{<PET> 

<HUMAN>: :=farmer {girl|policeman{hunter {man {boy 

<A>: : =<HUMAN> | <ANIMAL> 

<CUT>: :=cut{slice|{snip|slash 

<CUTTER>: :=knife|scissor {sword {dagger 

<BEE>: :=bee|wasp|horse-fly 

<HURT>: s=bite|frighten|{scare|{kick]eat 

END 

' <ANIMAL> <HURT> <HUMAN> 

<CUTTER> <CUT> <A> 

<A> break <CUTTER> 

water drown <A> 

<A> drink water 

fire burn <A> 

smoke suffocate <A> 

<BEE> sting <A> 

<A> swat <BEE> 

wind blow-out fire 

wind disperse smoke 

smoke pollute wind 

smoke smother fire 

<HUMAN> disperse smoke 

<A> spill liquor 

liquor intoxicate <A> 

<HUMAN> slay <WILD_At> 

<WILD_AN> eat <HUMAN> 

END 


Names referenced Name Type . Where defined 
by_RSTORY: RS ENTENCE Function Program 16.8 


ee eRe Ere Ee EEE tele ene eS SE ae Se EES Ee ED ED TD SER Chen cnr HO | 


Epilogue 


One example of a story produced by the program (untouched by 
human hands) is: 


Long ago in a small village there was a little old 
lady who went to a pet store and bought a cat. On 

the way home they came upon a ditch which the cat was 
afraid to cross. The little old lady said "cat, cat, 
jump over the ditch or I won't get home tonight." 

But the cat would not. The little old lady went over 
hill and dale and she met a water. She said, “water, 
water, drown cat, cat won't jump over the ditch and 

I shantt get home tonight." But the water would not. 
The little old lady went on the road to the red school 
and she met a man. She said, “man, man, drink water, 
water won't drown the cat, cat won't jump over the 
ditch and I shan't get home tonight." But the man 
would not. The little old lady went toward the blue 
church and she met a lion. She said, "lion, lion, 

eat man, man won't drink the water, water won't drown 
the cat, cat won't jump over the ditch and I shan't 
get home tonight." But the lion would not. The little 
old lady went toward the yellow rock and she met a 
smoke. She said, “smoke, smoke, suffocate lion, lion 
won't eat the man, man won't drink the water, water 
won't drown the cat, cat won't jump over the ditch 
and I shantt get home tonight." But the smoke would 
not. The little old lady went toward the blue house 
and she met a girl. She said, "girl, girl, disperse 
smoke, smoke won't suffocate the lion, lion won't 

eat the man, man won't drink the water, water won't 
drown the cat, cat won't jump over the ditch and 

I shan't get home tonight." The girl knew the little 
old lady and, in fact, had been saved by her froma 
wild ostrich. So the girl began to disperse the smoke; 
the smoke began to suffocate the lion; the lion began 
to eat the man; the man began to drink the water; 

the water began to drown the cat; the cat began 

to jump over the ditch and the little old lady got 

home that night. 


The reader will note that the story tends to be repetitious 
which is somewhat the point since small tots have a penchant 
for this sort of thing. 


In order to extend the robustness of the given program (where 
robustness is defined as the degree to which the stories vary) 
one may, Of course, extend the vocabulary. One of the limita- 
tions so encountered, is the necessity within English to 
observe certain grammatical niceties such as using ‘'she' to 
refer to a woman. This single fact, incidently, is the reason 
that the principal character in the story has feminine gender. 
To include any gender, one would at least need a function 
PRONOUN (W) which will return the third person singular  per- 
sonal pronoun for any word given as argument. While this task 
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is not formidable (with a limited vocabulary) a complete set 
of grammatical transformations which would include, for exam- 
ple, present tense to past and future, active voice to pas- 
sive, indicative mood to subjunctive, singular to plural, 
represents a considerable undertaking. Thus, with story 
generation, as opposed to mere sentence generation we come +to 
grips with much more severe syntactic problems. 


The semantic difficulties involved in considerably extending 
the robustness of the story generator are also of interest. 
It should be clear that the vocabulary section of RSTORY can 
be completely overhauled to produce stories in such diverse 
settings as the wild west, interplanetary travel, the Jurassic 
period (dinosaur days), etc. A weakness of the system is that 
one could not place the union of all such information into the 
story since, for example, the <excursion> variable might 
produce "the cowboy drove his spaceship past the red 
pterodactyl." We should want to at least draw actors and ac- 
tions into the _ story on a logical, though perhaps 
probabilistic, basis. The problem seems somewhat similar to 
the Analogy Problem [Tuggle 1973} in which a program attempts 
to fill in the blank in a sentence of the form 


Ais to Bas Cis to __ 
Here, a sufficiently rich data base makes such problems trac- 
table. Returning to our story, if CHAR is our principal 
character and we wish her (him) to travel we may says 


“cowboy is to horse as CHAR is to ___" 


in order to find an appropriate means of transport. We can 
see a bit of this in the specialized data section of RSTORY 
(the second set of data) which sets forth relations between 
individuals and specialized groups to obtain greater realism 
at the expense of robustness. These relations are, of course, 
all of a certain kind, viz. of the form <agent> <affects> 
<agent>. Increasing the kinds of relations is essentially what 
is required to solve the Analogy Problem. Thus, RSTORY may be 
augmented by the possibility of having one or more of the 
chain of agents wander off (after having been lined up) ina 
manner consistent with the agent (water might evaporate, fire 
burn out, lion be distracted by game, etc). This would add 
another dimension to the story. 


On a deeper level, one may wonder whether it is possible for 
the computer to play a greater role in the formation of the 
plot and deciding on the ‘point! of the story. Would computer- 
generated stories always remain in the entertainment category 
or could they serve some useful function such as describing 
some complex event within, say, an operating system? The ques- 
tion of randomly generated stories is currently a topic of 
considerable interest. See AI FORUM [1974] for a vigorous 
discussion and several other references. Also Knuth [Vol. 2] 
describes a random western which was used as the basis for a 
television film. 
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Ce er ee ee oe 
{ Exercise 16.1 { RANDOM(0) has a distribution which is 
t__________._____§ uniform over the interval (0,1). it is 


sometimes required to have other kinds of distributions. 
Define the distribution function (sometimes called the cumula- 
tive distribution function) D(X) of a random number generator 
R() as the function 


D(X) = Prob{ R() < xX} 


For example, the distibution function assocated with the 
uniform distribution slopes between 0 and 1 in the range (0,1) 
and is 0 below and 1 above this rannge. Given an arbitrary 
distribution function D(), write the random generator R() in 
terms of the uniform generator RANDOM() and the inverse of 
D(), call it ID(), which is presumed to exist. 


Coos ee ee ee ee ee 
{ Exercise 16.2 | Suppose that a program requires random num- 


L_____________§ bers between 0 and 1 in such a way that x 
is x/y times more likely to occur as y. Thus 1/2 is twice as 
likely to occur as 1/4. Write the distribution function D() 
for the generator. Write a program to produce the random num- 
bers (functions in the ARITHMETIC chapter can be used). 


[aCe REAR, 
{ Exercise 16.3 { Let a deck of cards be represented by 52 
t___._..._.-___-J_ separate characters, say: 


DECK = ‘ab ... ZAB ... 2° 


In one statement, deal out four 5-card poker hands to players 
P1, P2, P3 and P4&. (Any function(s) in this chapter may be 
used.) 


> = =. = ee 
{ Exercise 16.4 | A well-known game is to find, for a given 


_—______..-J_ telephone number, a sequence of letters 
which (1) when dialed will produce the same number and (2) are 
a pronouncable sequence. For example, 233-6874 can perhaps 
more easily be remembered as 'BEDMUSH' or *‘ADDNURI'. The cor- 
respondence is: 


2 ABC 6 MNO 
~3 DEF 7 PRS 
4 GHI 8 TUV 
5 JKL 9 WXY 


(1's and 0's create problems). 
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Write a function RPHONE .to accept a telephone number and 
return a random sequence of letters associated in the above 
sense with the number. The sequence should bear some 
Similarity to English; to do this, use RCHAR for probable next 
characters. 


ay : 
| Exercise 16.5 | What single statement can be modified so 
t_____________--§ that RSELECT (Prog. 16.7) saves space 
rather than time? 


ar naa aera | 

| Exercise 16.6 {| Augment the assignment interpreter in 
_________________—J RSENTENCE so that the variable assigned in- 
to need not also be the name of the syntactic variable expan- 
ded. One way to do this is to let 


=var/s= 
be interpreted as: 
var = RSENTENCE (s) 


Oooo en eae ee Te 

{ Exercise 16.7 | If the argument to RSENTENCE is not well 
t_____.-__-_________—_5. formed, the function can loop. Give an ex- 
_ample of a string which will have this effect. What modifica- 
tion to RSENTENCE can correct this? (Requires the addition of 
Six characters and a blank). 


SS ee 

{| Exercise 16.8 {| This exercise is based on a suggestion by 
u—_—______-_________1.  Yngve { 1962]. In the input to RSENTENCE 
let /text/ indicate that the result of evaluating text (via 
RSENTENCE(text)) is to be piaced in the stack after the next 
item. An item is defined as either a syntactic unit or a se- 
quence of non-blanks. Thus 


<SENT>3:= <NOUN> <VERB~-PHRASE> <NOUND> 
<VERB~PHRASE>: :=<VERB>/ <ADVERB>/ 


can result in“ He called her up". Incorporate Yngve's sug- 
gestion into RSTENTENCE. 


ee ee en ee 

| Exercise 16.9 {| In RSENTENCE, there are several characters 
1. which can't be used directly within alter- 
natives because they have some meta-meaning (such as <>|{ etc.) 
Define an ‘tescape' convention so that any special character 
can be incorporated in the final text. Implement your scheme 
(hint: this can be implemented by modifying one rattern). 


co < 
| Exercise 16.10 {| For which of the following definitions 


3 will <S> have a probability of looping 
greater than 0? 


7 
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(a) <S>3:=A|<S>AI<S><S>A 
(b) <S>3 s=#2#AI< SDA <S><SDAI<S><S><SOA 
(c) <€S>22:=A(<TO<T> 
<T>::=B(<S>c 
lar erie ere ee 
{ Exercise 16.11 | What is the probability that 


| EN 
<S>::=A|<S>A<S>B<S> 
as input to RSENTENCE will halt? 
Ce nr ee ee 
| Exercise 16.12 | The ‘one-arm bandits! of gambling fame 
t__________.____3 (also known as slot machines) have three 


windows in which one of 20 pictures can appear as _ follows 
[Spencer 1968]: 


Symbol { Wheel 1 | Wheel 2 |{ Wheel 3 
Se ee eee ae aN ras ae, Sa aaa eas a ear aca 
Cherry (C) { 4 | 6 { 0 
Orange (0) { 5 { 4 { 7 
Bell (E£) { & { 6 { 5 
Lemon (L) { 3 | 2 1 4 
Watermelon (W) | 3 t 1 | 3 
Bar (B) { 1 | 1 ! 1 
Payoffs are as follows: 

Cer = 3 WWB 15 

cc- 5 00°00 18 

0O0OB 6 Www (20 

EEO 8 B BB 200 

LLL 10 


Identify the sample space. Determine the total input to the 
machine and the total return if each item in the sample _ space 
is hit once and only once. What percentage of total bets is 
taken by the machine? Write a program to simulate the slot 
Machine (can be done in as few as 10 statements using SUBSTR 
(Prog. 3.9) and RANDOM). 


(Se a en ee 
{ Exercise 16.13 { (a) Write a program to compute the area 
_ —-—-——_———J_ under the curve Y = X2 on the interval 


[9,1) by Monte Carlo techniques. Print out this area every 
100 samples so that you can observe the rate at which the 
answer converges to its correct value (1/3). (Hint: this re- 
quires a total of three statements). (b) Compute the 95% con- 
fidence interval after N trials and compare this figure with 
the experimental results. 


Page 372 ___ Chapter 16 - STOCHASTIC STRINGS_ mee 


| Exercise 16.14 { To speed up the previous exercise, DUPL 
___-_______--_-__1 and CODE can be used so that the inner 
loop of three statements is reduced effectively to one. How 


can this be done? 


ee ee 

{| Exercise 16.15 | Modify RSEASON (Prog. 16.10) so that with 
t_____-_____-4 probability E a batsman will advance to 
first by means of an error where otherwise he would simply 
have made an out. All other runners should advance one base. 


eee ee ee 

{ Exercise 16.16 | Write a program called RGAME which will 
t__________1 behave like RSEASON except that RSENTENCE 
is used to supply running commentary of the events which 
transpire. Include names of players in the input data. Make 
your game colorful. Don't have a player merely make an out, 
have him hit a sharp drive to center which is speared by the 
centerfielder. 


CS ee ee : 

{ Exercise 16.17 { Sagasti and Page [1970] describe an effort 
_____—_--_—____——-JI__ tO program and actually stage a computer- 
generated dance routine. The apc is divided up into 13 areas 
roughly as shown in Figure 16.1 


Figure 16.1 


The decomposition of the stage to produce a random 
dance. 


A dancer is permitted to move from one circle to an adjacent 
one; for example, in Figure 16.1 a dancer at F can move to any 
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of A, B, E, G, J, or K; of course, the dancer may also remain 
at the same position. Dancers may exit and enter at random 
times but only to or from what may be called terminal nodes. 
For the exercise, let E, J, K, L, M and I be the terminals. 
Also, no two dancers may occupy the same spot at the same 
time. 


Implement a program to produce a random dance with the ad- 
ditional constraint that there be left-right symmetry. That 
is, for example, if a dancer moves from A to B then another 
dancer must move from D to C. To allow movement into the cen- 
ter position, create a new position Y which is offstage cen- 
ter. If a dancer at K goes to G then the dancer at L must go 
to Y, etc. Also, permit dancers at G and Y to change places. 
Denote offstage left as position X and offstage right as posi- 
tion Z. The output of the program should be a list of instruc- 
tions for each of eight dancers. 


Be careful! Sagasti and Page describe their initial efforts as 
resulting in “pandemonium on stage" until a slower tempo was 
found. They also described one dancer as "mildly bitter" being 
forced to leave early. 


e ef e 
| Exercise 16.18 | Change the story given by RSTORY to one 
—__.___________# involving a space motif. Use RWORD to 
provide stange-sounding names of people and planets. 
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ED Re GED SN ER SS A ES TE SY RN CS A RE 


J ames are artificial environments frequently abstracted 
4. from reality intended to amuse and/or exercise the 
{ cranium. The computer (and computer programmers) are 
{ quite proficient at simulating such abstractions, much 
4 more so than the reality backdrop, so that there has 
for a long time been a happy marriage between computers and 
game playing (frequently to the chagrin of management intent 
cn putting the high-priced piece of equipment to better use 
than amusing its high-priced employees). As the cost of com- 
putation diminishes, however, the recreational or game-playing 
applications of digitial computers may be expected to 
increase, and surely any survey of SNOBOL4 applications would 
not be complete were it to ignore this area entirely. The 
computer is, after all, the ultimate game if not the ultimate 
player. 


We almost, but not quite, include under the heading of games, 
attempts to make the computer behave (i.e. converse) like a 
human. Weizenbaum [1966] made a notakle attempt in this direc- 
tion with his program ELIZA. ELIZA will converse with the user 
in a form characteristic of a script given to it as data. The 
most familiar and popular script makes ELIZA behave like a 
psychiatrist. Though ELIZA was originally written in Fortran, 
Duquet [1970] has written a ‘dramatically shorter’ version in 
SNOBOL4. In SNOBOL4, the program is actually smaller than the 
psychiatrist script (two pages versus four). While we do not 
include the program here, we note in passing that dialogue is 
a necessary aspect of most games and a snappy dialogue can add 
an appeal to an otherwise not-too-exciting game. We will 
return to this issue later. 


For good or ill, many games have been programmed on the com- 
puter. At a nearby PDP-10 time-sharing computer there exist 
twenty-some games including Chess, Go, Black Jack, Go-Moku, 
Monopoly, Tick-tack-toe (two and three dimensions), Nim and 
games based on football, golf and Startrek to mention only 
those names that are immediately recognizable. There are many 
other games which have been, or will be, written for a digital 
computer; see Spencer [1968], Ball [1962] and especially Ahl 
{1973}. 


A game may be concealed or open. In an open game, such as 
Chess or Checkers, all information concerning the state of the 
game is available to both players. In concealed games, such 
aS in many card games or in penny matching, each player may 
have information unavailable to the other. This is clearly 
the case if one is holding cards unseen by one's opponent. 
With penny matching, the concealed information is the player's 
strategy. In a concealed game, the player must play in such a 
way as not to reveal his hidden information and therefore the 
techniques and analysis are quite different from the open 
game. 


In concealed games, there seems to be a problem involving 
player and computer credibility which does not exist with the 
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open game. Consider the game of penny-matching in which both 
players choose a side of a penny; one player wins (the other 
player's penny) if there is a match; otherwise the other 
Player wins. With a computer there is a problem. If the com- 
puter goes first, there is the possibility that the player 
will cheat. If the player goes first, he may suspect the 
machine of cheating. Hagelbarger [1956] built a penny-matching 
machine, calied SEER which ‘solved! this problem by the human 
saying aloud his choice of head or tail and the machine 
(sensitive only to sound) would indicate its choice whereupon 
the player would tell the machine, by a push button, who won. 
The machine can't cheat under these circumstances but the 
human certainly can. A counter was wired up to accumulate 
total wins and losses for the machine. Though the machine won 
most of its games, the results are clouded by the fact that 
some players would deliberately lie to the machine to see how 
it would operate in stressful situations. 


One solution to the concealment problem lay in the use of a 
one-way cipher (See ONEWAY, Prog. 16.4). Recall that given 
the returned value of ONEWAY(S) it is impractical to compute 
the original S or, indeed, any S which would yield the same 
returned value. Hence the computer can choose a random string 
R (possibly based on the clock) and then call ONEWAY(R 'H*) if 
it chooses a head or call ONEWAY(R 'T') if it chooses a tail. 
The computer prints the returned value. Then the player plays. 
The machine then reveals its move together with R. The player 
can check, if he cares to, whether the previously printed 
value corresponds to the given value of R. Spot-checking a 
machine for fraudulent behavior should, in this way, be fairly 
easy. 


A one-way cipher can also be used to make sure that a computer 
is giving you a fair deal. See Exercise 17.1. 


Decision Trees and Decision Graphs 


A decision tree exists, at least conceptually, for any 
discrete’ open game. The top node, or root of the tree, 
represents the decision node of the first player and has a 
branch descending down for each possible choice of the first 
player on his first move. Each such branch descends to a node 
representing the decision node of the second player, etc. An 
actual decision tree is produced for a simple version of the 
stone game (see Figure 17.1). 


Decision trees grow exponentially and hence tend to be large. 
A complete decision tree for the game of Tick-tack-toe is for- 
bidding enough. One for the game of Chess is so large as to 
be meaningless. For example, at 10. moves per play and for 70 
plays, the number of nodes in the tree exceeds the number of 
atoms in the earth. 


It is more convenient to think of an open game as a collection 
of states where each move carries the play to a different 
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state. There are terminal states which end the game and in- 
dicate a winner for one of the players. If every different 
move sequence leads to a different state, then the decision 
tree is equivalent to the decision graph. But in many games, 
the number of different states is far fewer than the number of 
nodes in the decision tree and the problem becomes amenable 
with a graph even though it appears to be impossible with a 
tree. 


One of the appeals of the decision tree is that it leads 
conceptually to a solution by means of the minimax process. 
The first player (A) selects that node which will maximize the 
outcome for him assuming that the second player will respond 
with the move that will minimize the output for A assuming 
that the first player responds with the move ... , etc. This 
strategy may be carried over to the decision graph as follows. 
Label all terminal states as +1 if a victory for the first 
player and -1 if a loss and 0 if a tie. Find a state that is 
directed only to terminal states. If it is a move by A, mark 
it with the maximum of the values of all states reachable from 
bh oa If it is a move by player B, mark it with the least such 
value. Each state will be thus marked with the value of the 
state to player A (assuming both players play optimally). If 
there is no state which is directed. only to states already 
marked, then the game is not well-formed as it contains loops 
(or, what is equivalent, infinite paths). 


It will clearly be impossible to present a large number of in- 
tricate game-playing programs in this section. One complete 
chess program could perhaps occupy the better part of this 
book. What we can do is present a few games illustrative of 
their type and also give some commonly useful functions. 


tt {1 For many computer-game players it is neces- 
tf 17.1 1 sary to provide a carrot and a stick; other- 
(1 | wise, they will simply lose interest and 
tJ quit. For the carrot we will issue a random 
compliment and, for the stick, .we will generate an insult. 
These are illustrated Ly the two functions PRAISE() and 
INSULT (). There is also a function to mark time called 
LETMESEE(). Using RSENTENCE (Prog. 16.8) the dialogue is al- 
ways fresh and lively. 


DEXP ("PRAISE() = RSENTENCE (*<PRAISE>') ") 
DEXP ("INSULT () = RSENTENCE('<INSULT>*) ") 
DEXP ("LETMESEE() = RSENTENCE('<LETMESEE>') ") 


Names_referenced "Name Type Where defined 
by_ PHRASE: DEXP Function Program 14.1 
7 RSENTENCE Function Program 16.8 


The input for RSENTENCE is: 
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<GOOD>: :=excellent {wonderful {nice] careful{ impeccable|shrewd| 

clever (nifty|good{smart {skillful|cunning[{witty|fine| 

splendid|elecant|#5#very <GOOD>{bright|brainy| brilliant|sharp| 

keen{nimble-witted|slick{slyjastute| penetrating 

<LETMESEE> : :=<THOUGHT> |<MUMBLE>|{<MUMBLE> <THOUGHT> [<THOUGHT> 
<MUMBLE> 

<MUMBLE>: :=Hmmm{Ahh|Well Well|Gosh{Gee|OK{Oh man[{Let's see] 

Wait a minute{Interesting|Wow|Wowee|Yipes| Zowee|Whoosh|{ 
#5#<MUMBLE> <MUMBLE> { #6#<MUMBLE>... 

<THOUGHT>: :=<LETME> <CONSIDER> <THIS> 

<LETME>::=I think I'llflet mejI need time toj{I'm qoing to 
have to 

<CONSIDER>: : =consider|contemplate{mull over | #4#<THINK> about 
<THINK>: :=think | see |{cogitate|meditate 

<THIS>::=this[{this one|the situation{this problem|this here 
<P1>3::=maneuver |strategem|tactic|play| move 
<P2>::=performance |game[effort 

<P3>::=play|{strategy 

<P13>:3:=<P1>s{<P3> 

<P23>: :=<P2>|<P3> 

<P123>: s=<P1>s | <P2>|<P3> 

<PRAISE>::=<THANKS> for the game, <NICEGAME> 
<THANKS>::=Thanks{Thank you{Thank you very much 
<NICEGAME>::=I admired the <GOOD> <P123> on your part] 

that was <GOOD> <P3> on your part{your <P1>s were quite 
<GOOD>}it was a pleasure to play against one so <GOOD>{I 
enjoyed your <GOOD> <P123>{I enjoyed particularly that last 
<GOOD> <P1> 

<STUPID>: :=stupid{ dumb {blunder ing|{thick-headedjsad{ 

thick-skulled{silly{ludicrous|witless{poor{| ponderous | 

brainless|foolish|{ bungling {heavy-handed {graceless|clumsy 
<FCOL>: :=f£001] dolt | idiot joaf {blockhead|chump|ass|moron{ninny| 
nincompoop|chump {dunce { bonehead{fathead|imbecile| jerk| baboon 
<INSULT>: :=You <STUPID> <FOOL>|I have never seen such <STUPID> 
<P13>{Your <STUPID> <P23> befits a <STUPID> <FOOL>| 

Your <STUPID> <P1>s indicate that you are a <STUPID> 
<FOOL>|{A <STUPID> <FOOL> is not so <STUPID> as youf 

Your <P23>.marks you as a <STUPID> <FOOL>{Your <P1>s are 
less than <GOOD> 

END 


Epiloque 


While random sentence generation has been around for quite 
some time, it generally comes in the form of a program which 
prints something. It is then neither obvious nor easy to har- 
ness the sentence generation for other than demonstrating the 
effect. It was for this reason that RSENTENCE was written as 
a function. 


Some sample phrases are: 


“Thanks for the game, that was nice strategy on your part" 
"you dumb idiot" 

"Interesting Hmmm..." 

"I'm going to have to consider this" 


~ ea PE OdEam_ 17.2 - QUEST Page 379 
"I have never seen such thick-headed strategems" 
"Thank you for the game, your plays were quite shrewd" 


It should be obvious which phrases were respectively returned 
by INSULT(), PRAISE() and LETMESEE() . 


Ces ee ee 7 

{{ Program {|| QUEST is intended to save some of the 
11 17.2 1 routine problems and house-keeping chores 
| QUEST 11 associated with a dialogue system. For ex- 
Ld ample, all game routines will request num- 


bers and/or strings from the player. The system must then 
check if these arguments are valid and, if not, indicate what 


is expected. If valid, the argument must be interpreted or 
assigned to a variable and an appropriate branch must be 
taken. Certainly, none of these chores are difficult to do, 


but it will be more convenient to .combine them into one 
routine. For example, 


QUEST (‘How much do you wish to bet?/BET(1...10) | (DROP) DR") 
+ 2S ($LABEL) 


will print the message: 
How much do you wish to bet? 


(i.e. all characters up to the slash) and then either accept 
an integer in the range 1...10 and assign it to BET or accept 
the literal input DROP and transfer to label DR. The transfer 
is accomplished by having QUEST assign the string 'DR' to the 
global variable LABEL; if such an assignment is made, the 
RETURN exit is taken; otherwise the FRETURN exit is taken. In 
this way, the actual transfer takes place outside the function 
as shown. 


In general, the string following the slash is called the QUEST 
pattern and is a sequence of descriptors separated by bars. 
Each descriptor is of the form: 


variable (values) label 


The variable, if any, is assigned the value (if accepted) and 
the label is assigned as described above. Values may be of 
the form: 


number. ..number 


or some string constant, or the string ARB implying that any 
string of characters will ke accepted. 


If the user types something that doesn't match, an error mes- 
sage (including a random insult) is given. Using the above 
example, the message (among other things) that will be typed 
is: 
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The correct form is: 1...10 {DROP 


In general, the message will contain the QUEST pattern with 
labels, variables and parentheses stripped off. 


As a final bonus, if the user ever types question mark (?), a 
friendly reminder of the correct form is given. 


DEFINE (* QUEST (QS) OP, OPA, ON, OVP, QL, QLOW, QHI, OI") 
ng ee 
{ First define a utility function QUESTP(QS,QP) which will | 
{ analyze the argument string QS according to the QUEST pat- | 
{ tern given by QP. It will fail if no match is found. { 
Wisden lc eileen lime igi cn em cnicio 


DEFINE (* QUESTP (OS, QP) OP1,QS1') : (QUESTP_END) 


ee ee eee ee ee ee 
{ Entry point: Break on an alternative and if one is found | 
{ call QUESTP recursively. { 
| SSE a aR CN aE EEE PE Nc A EE EE EE Ne a CT 
QUESTP QP BREAK('{') . QP1 ‘J! = :F (QUESTP_ 1) 
QUESTP (QS, OP 1) :S (RETURN) F (QUESTP) 


Ce ee te eRe ET th Tee ee el a eT Te ee ee 
{ In QP we now have a single QUEST descriptor. Obtain the | 
{ variable name (QN), the label name (QL) and the value pat- | 
{ tern (QVP). { 
a a es 
QUESTP_1 QP BREFAK('(") . QN '(" = :F (FRETURN) 

QN = IDENT(QN) ‘'QpUMMY' 

QP BREAK(')') . QVP ')* REM. QL 


a a a AIRE A BT RRS SL ERED, | 
{ If QS matches the value pattern, branch to QUESTP_3 for |{ 


{ the assignment. Convert QS if necessary to the proper | 
{ type. { 
Ee ar ee SR EE aE | 

IDENT (QVP, 'ARB‘) :S (QUESTP_ 3) 

OVP ARB . QLOW '...' REM . QHI :S (QUESTP_ 2) 

IDENT (QS, OVP) :S (QUESTP_3) F (FRETURN) 
QUESTP_2 QLOW = -~AINTEGER (QLOW) EVAL (QLOW) 

QHI = -INTEGER (QHI) EVAL (QHT) 

QS = CONVERT (QS, 'INTEGER*)} 3 F (FRETURN) 

(LE (QLOW,QS) LE(QS,QHTI)) :F (FRETURN) 

QUESTP_3 $ON = QS 

LABEL = DIFFER (QL) OL : (RETURN) 
QUESTP_END ; 


Ce ee 
| Define a pattern (QUEST.QPA) which will extract from a | 
| QUEST descriptor, the inner QUEST pattern. ID.V will match | 
{ an identifier assigning it to V. 1 
(ES En a et en ET a i IC I NO A | 


NEUT = BREAK('{()‘) 

QUEST.OPA = NEUT '(' NEUT . QPA ')* (NEUT | REM) 
A = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' 

ID.V = (ANY(A) (SPAN(A '0123456789_.') | 'f)) .V 


: (QUEST_END) 


a eae en ee a ee ee eee ee eens ee 
{ Entry point: After printing the message, interpret the |{ 
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{ input. Errors are processed at QUEST_1. | 
| Fern | 


QUEST LABEL = 
Qs BREAK('/*) . OUTPUT ‘'/' REM . QP 
QI = TRIM(INPUT) $ OUTPUT = QT 
QUEST_1 QP <ID.V '...' = EVAL(V) '...! :S(QUEST_ 1) 
QUEST_2 QP '...' IDV = '...' EVAL (V) :S (QUEST 2) 
(DIFFER(QI,'?') QUESTP(QI,QP)) :F (QUEST_3) 
CIFFER (LAREL) :S (RETURN) F (FRETURN) 
aie a aa a a a a La Ee a a ea | 
| Extract and print the pattern and also indicate our | 
{ feelings. { 
{SERS A ee ee | 
QUEST_3 QP QUEST.QPA = QPA :S (QUEST_3) 
OUTPUT = DIFFER(QI,!?*) 
+ RSENTENCE('Bad input, you <STUPID> <FOOL>') 
OUTPUT = ‘'The correct form is ' QP : (QUEST) 
QUEST_END 
Names_referenced Name Type Where defined 
by QUEST: STUPID Syntactic Variable Program 17.1 
FOOL Syntactic Variable Program 17.1 
Ce ee ee ee 
Program Let there be N stones ina pile (where N_ is 


| a 

tt 17.3 t{ odd) and let each player take, on each move, 
| {1 either 1, 2, «ee. , Or K stones from the 
t_——___________4 pile. When the pile is exhausted, the player 
with an odd number of stones wins. For example, if N=5 and 
K=2 we have a very simple game for which we can portray a com- 
plete decision tree as shown in Figure 17.1.. 


By applying the previously described minimax procedure (or by 
using common sense) the tree indicates a victory for the first 
player, A. If the rules of the game are changed to make the 
winner the one with even parity, the game is victory for B, no 
matter what A does on the first move. 


The decision tree algorithm can be employed if the tree is 
sufficiently small but becomes quite impractical as soon .as 
the game becomes nontrivial. To see this, let us fix K=2 and 
let N vary. The number of branches, E(N), in the tree is given. 
by the formula: ‘ 
E(N) = 2 + E(N - 1) + E(N - 2) 

which is immediately evident from the figure. While it may be 
an interesting exercise to solve this recurrence relation our 
purpose is served by simply noting that: 

E(N) > 2 * E(N - 2) 
so that 


E(N) > 2 ** (N/2) 
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Figure 17.1 


The decision tree for the stone game with N=5 and 

K=2. Player A goes first. At each node, three 

numbers indicate the number of stones left in the 

pot, the number of stones in A's possession and 
the number of stones in B's possession. Parens 

indicate a decision node for A, brackets indicate 

a decision node for B. 


which implies that E(N) is exponential. 


The decision graph on the other hand is quite well-behaved 
especially if we combine all nodes with the same parities for 
the two players. That is, for a given number of stones in the 
pot, we can group all nodes together such that the player 
about to pick has an even parity. In this way the number of 
nodes is only 2N and the number of branches is bounded by 2NK. 
Figure 17.2 indicates (within the limits of our artistry) the 
decision graph for the stone game (with K=2 and N=5). 


From the decision graph it is an easy matter for a program to 
compute an optimal strategy for a game of any N and any K and 
for either victory parity. A 2 X N decision array is allocated 
which corresponds to the nodes of Figure 17.2. The rest is a 
simple matter of using the QUEST routine. 
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Figure 17.2 


A ‘decision graph for the stone game with K=2 and 
N=5. The nodes on the left are associated with 
Odd parity and those on the right with even 
parity. Parity refers to the parity of the player 
about to move. 


The function SDA(NSTONES, PARITY,MAX) will create a Deci- 
sion Array for the Stone game for a given number of stones 
(NSTONES) . PARITY (0 or 1) indicates which parity wins 
and MAX indicates the maximum number of stones that may be 
taken per step. 

Ni ce a cee nh a ee 
DEFINE (*SDA (NSTONES, PARITY , MAX) A,I,OPAR,P,J') 

: (SDA_END) 


Ee OS CE SNE ANE EE AE REE 


Page 384 __Chapter 17 __-__ GAMES 


a aa aI aS SSDS EAS ERE RG SR TESS | 
{ Allocate and initialize the array (SDA). SDAXN,P> in- | 
| dicates what to do if there are N stones left and you've | 
{| got parity P. If there is no right decision, an 'L' for | 


| lose is given. t 
as ee ey 
SDA SDA = ARRAY('O:' NSTONES ',0:1' , 'L') 

SDAX0,PARITY> = ‘wt 


aia a ea AR nA aaa SER a mE EC re | 
{ For each stone (I) and for each parity (P), determine the | 
| strategy hy finding which move (J) will end in a losing | 
{ situation for the opponent. | 


ft oir armen rppmmeceapnenconerenepeneons espe ee eenemnpnenavertaieweh-tanreenrnenouiaounsenmmnantnemesremmemmanneiancmeess 


SDA_1 I = I +14 LYT(I,NSTONES) 2 F (RETURN) 
Pp = -1 

SDA_2 P = P +1 LT(P,1). :F (SDA_1) 
OPAR = REMDR(NSTONES - I - P, 2) 
J = 0 

SDA.3 J = J +1 LT(J,MAX) :F (SDA_2) 
IDENT (SDACI - J, OPAR>, 'L*) :F (SDA_3) 
SDACI,P> = J ; : (SDA_2) 

SDA_END 


ee ee ee Ep ee ee ee 
{ Main routine: The rules of the game follow the END label | 
| and are optionally printed (no sense boring the expert, he | 
{ may be you). The rest of the program should be self- | 
{ evident and will be given without further comment. | 
| SE | 

QUEST ('Do you want the rules?/ (NO) NEWG{ (YES)") :S($LABEL) 


STONE_1 OUTPUT = INPUT 2S (STONE_1) 
NEWG QUEST('No. of stones (odd) = /NSTONES(1... 1000) ') 
EQ (REMDR (NSTONES, 2) , 0) 2S (NEWG) 


QUEST("Winner's Parity (0...1) = /P(0...1)") 
QUEST ("Maximum Take = /MAX (2... 1000) ") 


OLDG NS = NSTONES 
MAXA' “= MAX 
A =  SDA(NS,P,MAX) 
HIM = 0 
ME = 0 
HIS_TURN ; 
‘OUTPUT = ‘There are ' NS ' stones in the pile.! 
MAXA = GT (MAXA,NS) NS 
QUEST (‘How many do you want? /K(1...MAXA) ') 
NS = NS - K 3; HIM = HIM + K 
EQ (NS, 0) . :S(TOTALIZE) 
MY_ TURN 
K = A<NS,REMDR (ME, 2) > 
K = IDENT(K, tL‘) 1 
NS = NS - K 
ME = ME + K 
OUTPUT = LETMESEE() 
S = kK .! stones.'! 
~S = EQ(K,1) ‘just one. 
OUTPUT = "I think I'll take " S 


EQ (NS, 0) :F (HIS_TURN) © 
TOTALIZE | 


OUTPUT = "You have * HIM * stones and I have ' ME § stones'* 


EQ (REMDR (HIM, 2) , P) :S (HE_WINS) 
OUTPUT = 'That means I win! 
OUTPUT = INSULT() : (CHANGE) 
HE_WINS 
OUTPUT = ‘That means you win' 
OUTPUT = PRAISE() 
CHANGE 
QUEST('Would you like to change the game? /'* 
+ ' (YES) NEWG[{ (NO) OLDG') : (SLABEL) 
END 
Names_referenced Name Type Where defined 
by_ STONE: QUEST Function Program 17.2 
PHRASE Package Program 17.1 
Epi loque 


It is necessary to be as complete as possible in the proces- 
Sing of input information when the user of the system is 
someone other than the person who wrote the program. This is 
especially true here where presumably the user is the playful 
sort anyway. This was the reason for the creation of the 
variable MAXA whose purpose is to limit the value of the 
selection to the maximum of the stated limit and the pile. 


An example of a typical session with the STONE game is shown 
below. Underlined sections indicate the machine's responses. 


Do_you_want the rules? NO 
No. of stones (odd) = 13 
Winner's parity (0.-.-1) = 0 
Maximum Take = 3 


a a ae Ramee eee eee aaa ame 


How_many do you want? 3 
Let me contemplate this one 


Sa Sn te ee ee a ee ere are ee ce AE ct EE SR a oe 
rn ee erence ae ere a ee I EP 


Se ee oe ee ee 
a ee er ee ae ee 
a ee ee ne a a LOS AE LE LS TS ES SE 
Se ee Se ee ee 
re ee re ir ae a ee er eT ET NE 
ee re rn ee a 


a nee a eee ae ee ES Sa ee TE RS 
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Sc ee eee eee 

t] Program 1 The reader is presumed familiar with the 
tt 17.4 a game of Tick-tack-toe whose popularity is 
{t{| TICTACTOE 1{f{ itself a puzzle since it is hard to do 
t____________________4 anything but tie. Nonetheless, it is in- 
teresting enough to illustrate several game-playing 
techniques. 


A complete decision tree for the game has nine possible 
‘choices for the first move, eight for the second, seven for 
the third, etc. Hence there are 9! (= 362,888) branches in 
the decision tree. Using SNOBOL4 and spending 10 milliseconds 
On each branch, one must spend 10 minutes of machine time to 
analyze the game, which is a bit much. When one considers the 
decision graph, however, there are only 39 = 19,683 possible 
boards and not every board is reachable by the rules of the 
game. Thus, there is a great deal of folding back. 


The pure tree-searching algorithm is actually quite simple 
Since one need only know how to make a move and how to detect 
victory. That is, assume we write a routine, TTTV, to deter- 
mine the value of a board to, say, Player X (i.e. the one who 
marks X's in squares as opposed to O's) and another routine 
TTTM, which determines an optimal move for X. An arbitrary 
board is given to fMTTTV which first tests whether a winning 
combination exists. If so, the value of the board is self- 
evident. If not, it asks TTTM for the best move for player xX. 
Upon getting it, TTTV evaluates the board from the point of 
‘view of player 0. It does this by interchanging O's and X's 
and calling itself recursively. It then returns the negative 
of the number so returned. The coding of TTM is even simpler. 
TTTM simply tries each move and asks TTTV to evaluate it (from 
the standpoint of player 0). This is not super efficient but 
it works. 


An algorithm based on the decision graph, on the other hand, 
May at first sight appear to be much more complicated re- 
quiring a complete graph description of the game. But we can 
let the computer do most of our graph-building as follows. 
Record each new state (new board position) that we come to in 
a table allocated for that purpose, and record with the table 
the move made. At each new situation, the table is consulted 
to see whether we've been there before. 


While these techniques are suitable for Tick-tack-toe, the 
search times become impractical for more complicated open 
games such as Chess and Checkers. To a first approximation, 
these games can be played with a truncated decision tree which 
means that the tree is searched to a limited depth and only a 
limited number of alternative moves at each level are 
considered. Samuel ({1963] describes a Checker-playing program 
which also stores boards as in the decision graph algorithm. 
This permits the program to learn as it continues to play. 
Note that storing a particular state helps not only when 
returning to that state kut in resolving the value of all 
states which can reach the remembered state. In the game of 
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Checkers the number of states that need be remembered can be 
reduced by considering all symmetries of a given board posi-~ 
tion. This is fully illustrated with the game of Tick-tack- 
toe. Thus if the proper response to: 


Oo | { oxi 
a is remembered ——-}+-— 
1x 4 to bes (x | 
——+-——-_-+-—— ——f 

{ 10 { { Oo 


then we should not have to recompute if 


1 10 
a 
1x | 
———+-—+-— 
of | 


is encountered. 


Assume that boards are represented as strings, so for example 
the last board above is represented as: 


“ C.x a. 3 


We can permute such a string very efficiently using positional 
transformations. But how many symmetries are there? Figure 
17.3 below illustrates the eight symmetries of the two- 
dimensional Tick-tack-toe board. 


O | { { {Oo { 1 [x | 
—t+—+—- )——— 1 +--+ HE -COC >+1 LT - 
{ {x { { xX | { | ( 
—+—_+—_-_- ——-++ + —-  —-+ + —-_-_— ——_ +--+ —_ 
I { 1x { { 10 oO | { 
O | { { {Oo 1x | | { 
—t+——+———- tt tH -——C +4 - 
I \ xX 4 { 1 { { { xX 
—t+—+—_  --+—— + —-—— 1 + -)—O > + +1 - 
(xt { \ 1 ne) oO | 1 
Figure 17.3 


The eight symmetries of the Tick-tack-toe board. 


A method for producing these symmetries is found by noting 
that the upper four are 90° clockwise rotations of each other 
as are the bottom four. The first of the bottom group is found 
by flipping one of the top group completely over so that we 


PAGS 398 eso Capt er oT oe GAMES 


are looking at its underside. Thus, with two basic permuta- 
tions we are able, with the help of a little counting, to 
produce all eight. 


It is not always easy to determine the number of symmetries 
for some arbitrary board game. A method that may prove helpful 
is to consider the number of equivalent serializations of the 
points of the board. For example, we can serialize the points 
of Tick-tack-toe in the order indicated in the diagram below: 


1421413 
eas Mats) Senate 
445 4 6 
74184 9 


An equivalent serialization would require that we begin at 
some corner (there are 4) and that we proceed along some edge 
(given the corner, there are 2 possibilities) and sweep the 
square one line at a time until all points have been touched. 
There are therefore 8 in all. 


Whereas before we could count approximately 20,000 different 
Tick-tack-toe boards, there are far fewer if we take into ac- 
count symmetries. Unfortunately, if we wanted to determine 
exactly how many we could not simply divide 20,000 by 8 to ob- 
tain 2,500 as this would not allow for the fact that some 
boards rotate or flip into themselves. Though 2500 is a good 
lower bound, to find the exact number one must use Polya'ts 
theory of counting. See for example Harrison [1965]. We will | 
be content with letting the program do the counting. 


In what follows we will define the functions TTTV and TTTM for 
the game of Tick-tack-toe. Given these functions, it should 
be an easy matter to write a complete program to play the game 
with a human opponent. Also, the program will play other games 
on the 3X3 board by simply changing the definition of losing 
pattern (LOS_PAT). It will play other O-X games on different 
size boards by changing the definition of equivalent board 
(the function NEXTBD) as well as LOS_PAT. These are left as 
exercises. 


TTTM remembers board positions by storing them in the table 
TIT: This table can be initialized with boards which block 
opponent victory (increasing efficiency) or with boards in- 
dicating heuristic plays or standard openings. These options, 
too, are explored in the exercises. 


sn a mec neem nnn | 
{ We first define a utility routine which cycles through all | 
{ the boards equivalent to a given Tic-tac-toe board. It | 
| expects as argument the last board returned. NEXTBD can | 
‘| always be initialized by setting NEXT_N to 0. | 
irs cen cc i le octets aap ssi us saosin pei ios aonoanenatiiesteninceaanaacal 
DEFINE (*NEXTED (B) ') : (NEXTBD_END) 
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re ete SIE Pepe Pa eae RS EE Sage pe gee ee een Paw Se ha eae 
{ Entry point: The first REPLACE is a clockwise rotation | 
| (done each time)... The second REPLACE is a flip (done every | 
{| four times). | 
a 


NEXTBD NEXT_N = EQ(NEXT_N,8) :S (FRETURN) 
NEXT _N = NEXT_N + 1 
NEXTBD = REPLACE ('741852963',' 123456789! ,B) 
NEXTBD = EQ(REMDR(NEXT_N,4)) 
+. REPLACE ('32 1654987", '123456789',B) 
: (RETURN) 
NEXTBD_END 


Qo ee Oe ee ee Ee eR ee ee RTE eee Meng og ea ee ee ee Ch ny 
{ TTTV(B) will determine the value of the board B to player | 
{ X given that it is his move. It is presumed that he does | 
{ not yet have a winning combination. { 
Sleepiness i ae ci itil mie at cai scapes h ena etia 
DEFINE (*TTTV (BOARD) ') 


LOS_PAT = POS(0) ('000' | 'O' LEN(3) 'O* LEN(3) ‘oO! 

+ { LEN(3) *000') 
: (TTTV_END) 

TTTV NEXT_N = 0 

TTTV = <1 
TTTV_1 

BOARD = NEXTBD (BOARD) :F (TTTV_ 2) 

BOARD LOS_PAT :S (RETURN) F (TTTV_1) 
TTTV_2 . 

TTTV = O 


TTTV = ~TTTV (REPLACE (TTTM (BOARD) ,"XO*,*OX')) : (RETURN) 
TTTV_END 


{ TITM will find the best move that player X can make on the 
{ given board. It first checks to determine whether it or 
{ any board similar to it was processed before. Old boards 
{ are kept in the table TTT. TTTM actually returns the new 
{ game state. ; 

(enna arsenate steps tres hn ees shi hss se ssoespneshsvennahogustnsaanasssctaeneerensemccanasnsall 


DEFINE (*TTTM (BOARD) T,N, MAX,V‘) 


TTT = TABLE() 
: (TTTM_END) 
TTTM NEXT.N = 0 
MAX = <2 
BOARD ¢ 3 :F (FRETURN) 
TTTM_1 BOARD = NEXTBD (BOARD) :F (TTTM_2) 
TTTM = TTT<BOARD> 
DIFFER (TTTM) 2S (RETURN) F (TTTM_1) 
TTTM 2 BOARD (TAB(N) ARB) . Tt * @N = T 'X' 3:F(TTTM_4) 
Vv = -TTTV (REPLACE (BOARD, 'OX','XO')) 
MAX = GT(V,MAX) V :F (TTTM_3) 
TTTM = BOARD 
TTTM_3 BOARD POS(N- 1) LEN(1) = ' ! 2 (TTTM_2) 
TTTM_4 TTT<BOARD> = TTTM : (RETURN) 


TTTM_END 
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| : 

444% ame Theory | In concealed games, we have the added 

% co) complexity that our strategy may tip off 
% 4% {| our opponent to our disadvantage. In any of the 
{8 % | varieties of the game of poker, for example, aggres- 
{| 888% | sive betting may scare off an opponent who might 
iJ otherwise stick and, in this way, fail to seduce him 
into betting more of his funds ina losing cause. It therefore 
pays to vary one's strategy and either not always bet aggres- 
sively with a good hand or bet aggressively with a bad hand 
occasionally (the so-called bluff). Many people feel that 
behavior such as bluffing is incompatible with machine play. 
But as we will see, machines can do very well in a game such 
as poker and in fact can play truly optimal strategies. 


Figure 17.4 


A two-person zero-sum game 


Let us take a hypothetical situation shown in Figure 17.4. 
There are two players, A and B, each with two possible moves, 
I and II. Each selects a move (unbeknownst to the other) and 
the matrix indicates how much B should pay A for each of the 
four possible outcomes. If the amount indicated is negative 
then the transfer of funds is in the direction from A to B. 
The game is called zero-sum because whatever one player wins 
the other loses; a situation which does not always exist in 
real life when, for example, a nuclear holocaust could be 
disastrous for both sides. 


How should A play the game? If he tries for the big payoff of 
4 by always selecting move II, B will catch on eventually and 
begin playing move I exclusively. Then A, seeing that he is 
losing 2 on each turn will begin selecting move I until B cat- 
ches on to that. Clearly both sides must play a so-called 
mixed strategy wherein their selection of I and II is un- 
predictable, Neither player should base their move on a 
Strictly deterministic basis as this strategy may be uncovered 
by the opponent and exploited. This conclusion is perhaps in- 
tuitively implausible but one need only reflect on the penny- 
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matching game to see the importance of not developing easily 
detectable patterns of play. 


Uf tI As a fairly complicated example of a game- 
t{ 17.5 | theoretic approach, we will present a 
{1 (1 program which will play an optimal game of 
L-________________J poker. Prior to presenting the game we will 
establish certain utility functions which may be useful not 
only in other forms of poker but perhaps in other card games 
as well. 


An important initial consideration is the choice of data 
representation. How should a card be represented? In SNOBOLA, 
with its wealth of string operations, a natural choice is a 
Single character. We will represent the 52 cards of the deck 
by the letters of the alphabet: 


*ABCDEFGHIJKLMNOPORSTUVWXY Zabcdefghi jklmnopqrstuvwxyz' 
The assumed ordering is: 
(2c 3C eee AC) (2D 3D eee AD) (2H 3H ese AH) (2S 3S eee AS) 
In principle, any 52 characters could have been used such as 
the first 52 characters of S&ALPHABET. In practice, debugging 
is easier if one uses printable characters. 
DEFINE (*RHAND (K, FLAG) *) 
DEFINE (' SUITS (H) ') 
DEFINE ('VALS (H) *) 
DEFINE (*DISPLAY (H) VALS, SUITS,V,S"‘) 


Gr en eB ne for ny Cae ee Ee PP Reger ee ee ee a te ee ee ee ee 
{ Initialization of constant strings. { 
Nea eesstsatenssmctn ota ruse Tu sD sss eS sss Ss sss sc ns Sunes esonsaneenovenenanemnssraneselD 


FULL_DECK = 
+ ‘abcde fghi jk lmnopqrstuvwxy ZABCDEFGHIJKLMNOPORSTUVWXYZ' 
| ALL_VALS = 'ABCDEFGHIJKLM! 
JUST_VALS = DUPL(ALL_VALS,4) 
JUST_SUITS = DUPL('c',13) DUPL('D',13) DUPL('H', 13) 
+ DUPL('S', 13) 


: (CARDPAK_END) 


a SR Na SRR aS ESS LE SELL, | 
{ RHAND(K,FLAG) will return a random hand with K cards in |{ 
{ it. If FLAG is nonnull, the deck will be reshuffled. If | 
{ an insufficient number of cards remain, RHAND will fail. ‘| 
enc ee pa cece min nar i ie pms rs mises sine mts le 
RHAND RANDOM_DECK = DIFFER (FLAG) RPERMUTE (FULL_DECK) 
RANDOM_DECK LEN(K) . RHAND = :F(FRETURN) S(RETURN) 


Ga a ee ee ee ee ee ee eg ee 
{ SUITS(H) will returm just the suits for the hand H. | 
Ge cee cmon cee namin mms oon heen iin meee oem nema samen nsiamidimpeinaamoremsaell| 


SUITS SUITS = REPLACE (H, FULL_DECK,JUST_SUITS) : (RETURN) 


ee ae wea ee eee ee Re cane Wer ee eee eR TEE iY STE SOE ate 


es ee ae ay CO ae ee I Cans OE aE EET Tete Ne a ee ag ET pg Gir get get a ee ee ee ee 
{| VALS (H) will return just the values of the hand H. | 
a a nant a ll 
VALS . VALS = REPLACE (H, FULL_DECK,JUST_VALS) : (RETURN) 


Cre ee eee ee Ee ee ee ep ee eee ae ee ewe ee oe Ta 
{| DISPLAY (H) will return a string representing the hand H in | 
{ a form consistent with conventional representations. { 
ceca copper ise io nai ss ts ap capt cis en en sates seasonal 


DISPLAY VALS = REPLACE(VALS(H) , ALL_VALS, '23456789TJQKA') 


SUITS = SUITS (H) 
DISPLAY_1 

VALS LEN(1) .V = :F (RETURN) 

V = IDENT(V, 'T*) 108 

SUITS LEN(1) .S = 

DISPLAY = DISPLAY vVs!'! : (DISPLAY_ 1) 
CARDPAK_END 
Names_referenced Name Type Where defined 
by CARDPAK: RPERMUTE Function Program 16.3 

ORDER Function Program 3.1 

Se 
(' Program {|| As a prelude to finding an optimal strategy 
{1 17.6 {{ of a game of poker we will write a function 
| POKEV 1 POKEV(HAND) which will evaluate a poker hand 
t____________—_-J (5 cards) producing a number (very nearly) 


uniformly distributed in the range (0,1) and monotonically 
increasing with the strength of the hand. Thus, hand H1 is 
stronger than H2 if POKEV(H1) > POKEV(H2). The constraint that 
the numbers be uniformly distributed is very important to the 
successful operation of the optimal POKER-playing program. 
That is, the percentage of times that a hand H will be such 
that POKEV(H) < X must ke X or close to it. This is perhaps 
the trickiest part of the program. 


To begin with we find, via pattern matching, which of the 
several categories the hand falls into, eg. bust, pair, two- 
pair, three-of-a-kind (trips), etc. We set an array (POKEV_A) 
to contain probabilities that such hands are dealt. The 
prokabilities can be computed or looked up in a source such as 
Epstein [1967]. We then need to resolve the question of where 
a given hand falls with respect to all other hands in its 
category (the variable FRACTION). This may be done crudely by 
regarding the values of the hand, sorted in descending order, 
as a number in a base-13 radix system. Unfortunately (as the 
author learned by experience) the result is too inaccurate to 
lead to optimal play. Consider for example, bust hands. Few 
hands would have a lead value of 10 or less and no hands would 
have a lead value of 6 or less. Hence no hands would evaluate 
to .15 or less, a severe distortion. 


A solution is to consider the hand as representing a number in 
the combinatorial number system (see DECOMB, Prog. 15.2). This 
system has the property that the digits descend, just as re- 


oe ee were ene ene ee ens meee ae Pe A ES SE EE SEAMED SS SS ES ES AEN OS ES SS MTT SOD Wi OE 


quired. Were it not for straights, the representation for bust 
hands would be exact. 


For hands such as pairs, trips, two-pairs, fours, and full- 
houses we take the most significant designator (one or two 
cards) as a base-13 number and combine this with the remaining 
cards in a mixed residue fashion to obtain a final evaluation. 


DEFINE (' POKEV (H) VALS, SUITS, V,W") 


Hr re re Oe OR En Pe ee ee 
| Define patterns to detect major poker categories { 
ae esse ast cu learn i. tt mm ppm ami snl se ui ins rs eng cea aco 


STRAIGHT_SEQ = REVERSE (ALL_VALS) SUBSTR(ALL_VALS, 13,1) 


PAIR.V -= LEN(1) $V ¥*V 
TRIPS.V = PAIR.V *V 
FOURS.V = TRIPS.V *V 


FLUSH.V FOURS.V *V 

ee ee ee ee ee 
{ The following array gives the probability that a hand will | 
| fall within or lower than the indicated level. 0 is af 
{ bust, 1 is a pair, etc. ! 
{ESSE SCPE ea oc ee I ER SE a Se ae aE eT | 


POKEV_A = ARRAY('-1:8') 
POKEV_A<0> = 0.501 
POKEV_A<1> = 0.924 
POKEV_A<2> = 0.971 
POKEV_A<3> = 0.9924 
POKEV_A<4> = 0.9963 
POKEV_A<5> = 0.9983 
POKEV_A<6> = 0.99974 
POKEV_A<7> = 0.999985 
POKEV_A<8> = 1.0 


PR(L,PREFIX) is a utility function used by POKEV to com- 
pute the actual evaluation of the poker hand, assign it to 
POKEV and return. L is the level of the hand as in the 
above array. PREFIX is the secondary evaluation parameter 
and consists of zero, one or two cards (e.g., the 6 of 
trip 6's). For further resolution, the variable VALS con- 
tains the rest of the values in order of significance. 
These are regarded as a combinatorial representation of 
some number. 


ee ee ee ee ee ee ae 
— at em ee ee ee ee oe 


DEFINE (* PR (L, PREFIX) COMBS, FRACTION, A') +: (POKEV_END) 


4°) 
vs) 


COMBS = COMB(13,SIZE (VALS) ) 

BASEB_ALPHA = ALL_VALS 

COMB_ALPHA = ALL_VALS 

FRACTION = (BASE10 (PREFIX, 13) * COMBS + DECOMB(VALS)) 
+ / (13. ** SIZE(PREFIX) * COMBS) 


POKEV = AXL - 1> + (ACID - ACL -. 1>) * FRACTION 
PR = .RETURN : (NRETURN) 


a 
| Entry point for POKEV. Thanks to PR, our job reduces to a | 
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{ simple matter of pattern matching. | 
[Ee | 
POKEV VALS = REVERSE (ORDER (VALS (8) )) 

SUITS = SUITS (HB) 

STRAIGHT_SEQ VALS { ROTATER(VALS,- 1) : F (POKEV_ 3) 


 gurTs FLUSH. V 2S(PR(8)) F(PR(4)) 
POKEV_3 
SUITS FLUSH.V :S(PR(5)) 
VALS © PAIR.V :F(PR(0)) 


:S(PR(7,V)) 
: F (POKEV_5) 


VALS FOURS. V 
VALS §- TRIPS.V 


w= V.- 
VALS PAIR.V = :S(PR(6,W V)) F(PR(3,W)) 
POKEV_5 
VALS PAIR.V = 
w= ‘Vv 
VALS PAIR.V = :S(PR(2,W V))F(PR(1,W)) 
POKEV_END 
Names_referenced Name Type Where defined 
by POKEV: ORDER - Function Program 3.1 
ROTATER Function Program 3.5 
REVERSE Function Program 3.6 
COMB Function Program 15.1 
BASE10 Function Program 2.5 
CARDPAK Package Program 17.5 
DECOMB Function Program 15.2 
SS 


(t Program {| As the reader may be aware, there are many 
tl 17.7 | forms of the game of poker; Draw, Stud (5 
{I tt and 7 cards), Baseball, Blind, etc. There 
L-——__——_________J may be wild cards and there may be any num- 
ber of players. We will pick the simplest game, viz. cold-hand 
five-card poker between two players with nothing wild. This 
choice is dictated by the simple fact that it is the only 
poker game that has been fully analyzed [Cutler 1975] and for 
which an optimal strategy exists. The reader may obtain ad- 
ditional references to the analysis of this game from Cutler's 
paper or from a cited bikliography, Findler [1972]. 


In cold-hand poker, each player enters an ante into the pot 
-and is dealt a hand (best thought of as a number in the range 
(0,1)) and the players take turns betting, checking, calling, 
raising and folding. Briefly, checking and betting are done 
when the pot contains equal contributions from both players 
(such as at the start or after a check). Calling, raising and 
folding are done when it is up to one of the players to 
equalize the pot. If he does not, he folds, forfeiting his 
right to the pot. If he calls, there is a showdown. A raise 
is a call followed by a bet. The set of possibilities are 
shown in Figure 17.5 where the first player is designated xX 
and the second is Y. Note that Check-raises are not permitted. 
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Call Call 
A A 
| | 
cr" ( ees: | 
aaa Bett >1 Y { Ra ise———>{ xX | Raise————>... 
{ Lf | ee | 
| | { 
eo v v 
x { Fold Fold 
een | 
{ 
cf c$ceorn"l" 
t_—_--——-Check-———>] Y | Bet >t xX f >Call 
| eee | a | 
| { 
Vv v 
Check Check 
Figure 17.5 


The allowable bet sequences of cold-hand poker. 


In the game given by Cutler, the value for all bets is the 
current value of the pot. The value of a raise is found by 
decomposing the raise into a call followed by a bet. We will 
extend the game somewhat by allowing the player to set the 
value of the bet (before-hand) to any fraction of the pot. 
Whereas all poker games require some limit, most games do per- 
mit players to bet any amount up to this limit. It has been 
conjectured that any bet short of the limit is suboptimal so 
that it might be reasonable to allow the player to make _ sub- 
maximal bets. But then the strateqy, particularly when to 
fold, would have to be changed. 


The derivation of the optimal strategy is beyond the scope of 
the current discussion. To obtain a flavor for the analysis, 
consider only the case where the first player, X, may check or 
bet and the second player, Y, either calls or folds. Since 
Y's move ends the game, he has nothing to conceal from xX and 
so he plays a pure strategy of calling on all good hands 
(anything above a certain value called the call line) and fol- 
ding on poor hands (anything else). Now consider X's. situa- 
tion. On very strong hands, X has nothing to lose by betting. 
On his average hands he has very much to lose if he bets since 
he would have to square off against Y's ketter hands. On the 
other hand, if he has an absolutely rotten hand, his only hope 
of winning is to bluff Y. Though he stands to lose more if 
caught bluffing, his expectation, it can be shown, is larger 
than if he stood the certain loss of a showdown with Y. The 
pattern of this simple situation holds in all the more complex 
cases, viz. a bet on all hands above a certain level and a 
bluff on all hands below a certain level. Also the bluff must 
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be in a fixed ratio R of the percentage of legitimate bets 
where R depends on the bet limit. 


We list here for convenience, various parameters used by the 
poker program. 


bet limit as'a percentage of the pot. 


L = 

R =. the bluff ratio (L/ (1 + L)) 

A = the initial betting line for player X. X bets on hands 
greater than this. He checks on hands worse, except that 
on his lowest (1 - A) * R hands he bluffs. 

B = the call line for player X after the sequence Check~Bet. 
Below this line he folds. He has no other options. See 
Figure 17.5. 

Cc = the betting line for player Y after Xx checks. Below this 
line, player Y calls except for the lower R * (1 - C) 
hands which he bluffs. 

D = The call line for player Y¥Y after X bets. Above this 


line, he will call (except for the very good hands which 
he bets) and below this level he will fold (except for 
the bluffs). 


The astute reader will note that the game can go on in- 
Gefinitely whereas we have provided parameters for only a 
finite number of situations. The parameters ALPHA and BETA 
below serve to bridge the gap between the finite and the in- 
finite as they provide rules for extrapolating out to the Aen 
raise. 


ALPHA = the raise attenuation factor. Given that the 
opponent's best strategy is to raise with his best P 
hands, then our best strategy is to respond by raising 
on our best P * ALPHA hands. Note that the raise at- 
tenuation factor for a round trip is ALPHA2 and this 
factor is actually used in the program. 


BETA = the lion factor. Given that my optimal strategy is to 

bet (or raise) in the upper P hands, then, if my oppo- 
nent responds by raising, I will fold below the BETA * 
P line (unless I'm bluffing). (1 - BETA is sometimes 
called the chicken factor.) 


Dg ek ee ed RT ee eG cee Ee ee ee Tn ete eee Oe nT eT te en 
| The function ABCDR(L) will set the global variables A, B, | 
{ C, D and R as well as the parameters ALPHA and BETA. It | 
{ is assisted in this by the functions ALPHA(L) and BETA(L) | 
{ which compute ALPHA and BETA respectively. | 
en a en pee nhl ens ic mesg tem oma apis aiasingmoaoncleatccaeiand 

DEFINE (*ABCDR (L) THETA, PHI, TAU, TTR') : 

DEFINE ("ALPHA (L) T') 
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DEFINE (BETA (L) T') 
: (ABCDR_END) 
a nnn 


{ Entry point for ALPHA: { 


a asec oven nme eu mes cpm nic ite amp nt ri cnamnaninemniiait mat 


ALPHA T = 142 %* L 
ALPHA = -(T + 1) + SORT(T ** 2 +6 * T + 1) 
ALPHA = ALPHA / (2 * T) : (RETURN) 


SS ee ee 
{ Entry point for BETA: ! 


| Ee | 


BETA T = 1T+2*E 
BETA = -(T ** 2) +2 * T 41+ (T- 1) * 
+ SQORT(T ** 24 6 * T + 1) 
BETA = BETA / (2 * T ** 2) : (RETURN) 


eS ee 
{ Entry point for ABCDR: { 


ener nese encoun t-shirt ns vanasavatheamenestasiesmnensnwnnecansnemareneseavemasnaraammrenansueemmnall 


ABCDR ALPHA = ALPHA (LI) 
BETA = BETA(L) 
PHI = L/(1+2* LD) 
THETA = 1 - PHI 
TAU = 1+2*4 
R = L/(1#L) 
TTR = TAU * THETA / R 
A = -1 + 2 * PHI + ALPHA + TTR * (4 * PHT + 2 * ALPHA) 
A= A/ (TAU * THETA + ALPHA + TTR * (2 * ALPHA + 1)) 
B= 4 * PHI + 2 * ALPHA - (2 * ALPHA + 1) * A 
C = 2 * PHI + ALPHA - A * ALPHA 
D = R * (1 + ALPHA) - R * ALPHA 


: (RETURN) 
ABCDR_END 


re re ee eres oe AEP ee eye gee FE ee TE ee fs ke ee ne ea ee ee ge ea 
| BET() will compute the amount which can be bet with a | 
{ given limit L. | 
a ee 


DEFINE ("BET () ') : (BET_END) 
BET BET = CONVERT(POT * L, 'INTEGER') 
BET = GT(BET,HIM) HIM 
GT (BET, 0) :S (RETURN) F (FRETURN) 
BET_END 
Now for the POKER program proper. Given the mnemonic 


labels, the use of QUEST, and the discussion in the text, 
comments are virtually unnecessary. The request for the 
lucky number is simply a device to warm up our random num- 
ber generator so that identical hands will not always be 
dealt. 


Oc eerntescvnsnnnen senate =the sess eerremeenenemnall 


OUTPUT = ‘Welcome to Cold-hand Poker' 

QUEST ('Would you like to know the rules? /' 
+ ' (YES) | (NO) INIT*) :S ($LABEL) 
PLOOP OUTPUT = INPUT 3S (PLOOP) 


INIT 
QUEST (‘What is your lucky number today?/RAN_VAR (1... 1000) ') 
HIM = RANDOM(100) + 20 
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OUTPUT = "We'll start you off with " HIM " chips" 
NEWP QUEST ('*Bet limit (% of pot) = /L(10... 1000) ") 
L = Lvs 100. 
ABCDR (L) 
ANTE QUEST ("What's the ante? /ANTE(1...HIM) ") 
START GT (ANTE, HIM) 2S (ANTE) 
POT = 2 * ANTE 
HIM = HIM - ANTE 
OUTPUT = ‘With a * ANTE * chip ante the pot has ' 
+ PoT * chips' 
HX = RHAND (5,1) 
X = POKEV(HX) 
HY = RHAND(5) 
Y = POKEV(HY) 
OUTPUT = ‘You are dealt ' DISPLAY (HX) 
RAISE = (1 - A) * ALPHA 
CALL = 1-0D 
QUEST ('Would you like to bet(B) or check(-)? /* 
+ ' (B) HE_BETS| (-) HE_CHECKS') :S ($LABEL) 
HE_CHECKS OUTPUT = LETMESEE() 
(LE((1 - C) * RY) LT(Y,C)) _ $8 (I_CHECK) 
I_BET BET = BET() :F (CANT_BET) 
POT = POT + BET 
OUTPUT = "I guess I'1l bet " BET " chips." 
QUEST (*How about you, call(C) or fold(F)? /' 
+ *(C) { (F) I_WIN') :S (SLABEL) 
HE_CALLS POT = POT + BET 
HIM = HIM - BET : (COMPARE) 
I_CHECK — OUTPUT = "I'11 check too" : (COMPARE) 
HE_BETS . BET = BET() 7 :F (CANT_BET) 
POT = pot + BET 
HIM = HIM - BET 
OUTPUT = ‘You bet ' BET ' chips.' 
OUTPUT = LETMESEE () 
GT(Y,1.- RAISE) :S(I_RAISE) 
GT(Y,1 - CALL) :S(I_CALL) 
LT(Y,R * RAISE) 2S (I_RAISE) F (I_FOLD) 
I_RAISE _ OUTPUT = "I'll see your ee) a chips" 
POT = POT + BET 
BET = RBET() :F (CANT_BET) 
OUTPUT = " and raise you “" BET 
POT = POT + BET 
QUEST (You must now raise(R), call(C) or fold(F) /* 
+ - ©(R) | (C) HE_CALLS| (F) I_WIN') 2S ($LABEL) 
HE_RAISES OUTPUT = You call my ‘ BET * chips and! 
HIM = HIM - BET 
POT = POT + BET 


CALL = RAISE * BETA 
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RAISE = RAISE * ALPHA * ALPHA : (HE_BETS) 
I_CALL OUTPUT = 'OK, I call' 
POT = POT + BET : (COMPARE) 
CANT_BET OUTPUT = 'Since you have no money left we ' 


+ 


"have to stop here! 


COMPARE OUTPUT = "Let's see, I have ™ DISPLAY (HY) 
GT (X,Y) :S (HE_WINS) 
I_WIN OUTPUT = 'I guess I take all ' POT ' chips in the pott 
OUTPUT = INSULT () 3: (SUMMARY) 
I_FOLD OUTPUT = 'I fold' 
HE_WINS HIM = HIM + POT 
OUTPUT = ‘You win the ' POT ' chips in the pot' 
OUTPUT = PRAISE() : (SUMMARY) 
SUMMARY OUTPUT = ‘You now have ' HIM ' chips? 
OUTPUT = EQ(HIM,0) 'So Long! : S (END) 
QUEST ('Same game (S) or new parameters (N)? /' 
+ '(S) START] (N) NEWP*) : (SLABEL) 
END 
Names_ referenced Name Type Where defined 
by_ POKER: QUEST Function Program 17.2 
SQRT Function Program 15.6 
POKEV Function Program 17.6 
CARDPAK Package Program 17.5 
Epilogue 


The following session was actually obtained using the above 
poker program. As usual, underscored items indicate responses 
by the machine. 


ee re ee ee a 


Would you like to know the rules? nope 


The correct form is YES|{NO 
Would you_ like to know the rules? NO 
What_is your lucky _ number today? 177 


Bet Limit (% of pot) = 100 
What's the ante? 10 


With a_ 10 chip ante the pot has 20 chips 

You_are dealt 7D _4C 8D 6D AD 

I_need time to meditate about this problem 

I!11 check too 

Let's see, I have 10D 9S 2D OC JD 

You_win the 20 chips in the pot 

Thank you very much for the game, I enjoyed your brilliant 
effort 
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Same _gqame_ (S) or new parameters (N)? S 
With a 10 chip ante the pot has 20 chips 
You_are deait 9D 6D JD 5S 2H 

Would you like to bet(B) or check(-)? BE 
You bet 20 chips 


Se ee te ee ee 


Not all games are this brief. With lower betting limits, op- 
timal play calls for generally more betting. The most complex 
bidding sequence resulted with a bet limit of 10% of the pot. 
‘The player was dealt two-pair and bet ruthlessly. The machine 
also bet heavily raising three times before calling. The 
machine had a full house. In general, however, the machine is 
very conservative and most bidding sequences are quite short. 


The use of the ‘lucky number' ruse to initialize the random 
number generator is common but entirely unnecessary if one has 
the time-of-day available to him. The time of day is actually 
available in many SNOBOL's, though not in the original. 


Though the reader may be expected to understand most of the 
routines in this book, the equations used in the function 
ABCDR to compute these parameters are probably not in this 
category. At this writing, this is their only appearance in 
print. 


Ee RS gE ae ae RE ER a a ae a 
P2228 C2822 22 ITI 222227 
OB a aE eS a ak A a a a 
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SSS 
| Exercise 17.1 { Assume a machine and a player would like to 
-_____________.§ play cards. If the player shuffles. and 
deals, the machine may be cheated. If the machine randomly 
generates hands, the player could be cheated. How can a one- 
way cipher be used to ensure a fair deal? 


Wo eee ee ae 

{ Exercise 17.2 {| Assume one had a program to play penny- 
——___________§. matching such that the program attempted to 
find patterns in the play of the opponent. Assume that there 
were no randomizing component in the program but that it was 
strictly deterministic. Is there a strategy which will beat 
such a program? 


So ee ee : 
{| Exercise 17.3 | Categorize and describe the decision graph 
t—___-___________.§ for the following game. Player A _ places 
$10 in the pot and player B places $1 in the pot. First it is 
player A's turn and he can bet $1 whereupon B must call or 
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fold. Tf B folds, A takes the pot. If he calls, he matches 
Ats $1 and it remains Ats turn. The procedure continues until 
A choses not to bet whereupon they roll a die. 1 or 2 is vic~ 
tory for B; 3, 4, 5 or 6 is victory for A. 


SS en 

{ Exercise 17.4 | Write a function PHRASE(LIST) where LIST is 
t.__________--3. a list of names separated by commas which 
will, for each name NM in the list, (1) define a function by 
that name and (2) compile code so that the function returns 
RSENTENCE('<NM>'). In this way, for example, 


PHRASE (' INSULT, PRAISE, LETMESEE' ) 


could take the place of the function definitions given in 
Prog. 17.1. 


ee ee 

{| Exercise 17.5 | Some variables cannot be used in a QUEST 
t-__-__________..§ descriptor (Prog. 17.2). Give a simple rule 
to prospective QUEST users so that they may avoid any dif- 
ficulties. How would you modify QUEST so that a diagnostic 
can be given. 


Cn re en 

| Exercise 17.6 | One of the reasons that QUEST was written 
L_—______-____.-5. with a separate utility function QUESTP was 
so that it could be easily modified to handle extensions of 
the following kind. Extend QUEST so that several arguments 
may be supplied separated by commas. QUEST patterns are then 
any combination of QUEST descriptors joined by the operators 
comma(,) and alternation ({) with comma having higher 
precedence. Also allow parenthesis in such expressions. 


i aaa eae are ream, | 

{ Exercise 17.7 {| Extend QUEST so that it accepts, in addi- 
L______——----J_ «t{10n to number ranges, letter ranges of the 
form (C4y-C2) where C, and Cs are single characters. 


| ane la a rae | 

| Exercise 17.8 {| The game of NIM is such that there are four 
U________________J piles of 1, 3, 5, and 7 stones. Each player 
may take any number, including all, of any one pile. He must 
take at least one stone, however. The person forced to remove 
the last stone loses. There is an optimal strategy for NIM 
which guarantees a win for the first player which is based on 
converting the numbers to binary and exclusive-ORing on a 
digit-by-digit basis. There are also optimal strategies if 
the game is extended to selecting from any K piles; one then 
uses a K+1 system; see Ball [1962]. 


But the game can easily be perturbed so that the optimal 
strategies can't be used. Examples include placing a limit on 
the number of stones or requiring that an even number be fol- 
lowed by an odd. Of course, such rule changes do not 
invalidate a decision graph. approach. For these reasons, if 
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not for the sheer joy of doing so, write a function NDA(S) 
which will prepare and return. a NIM decision array. S will be 
a String of initial-pile numbers such as ‘'1,3,5,7'. Assume 
the one-pile no-limit restriction on betting. 


| Nita er ae aaa | 

{ Exercise 17.9 {| Modify the function SDA (of STONE (Prog. 
tJ. 17.3)) so that the variable MAX designates 
a list of possible moves separated by commas. For example, 
MAX = '1,3,5' means that 1, 3 or 5 stones may be selected. 


TT SE AE.. 

{ Exercise 17.10 | Amaze your friends with this one. Modify 
tI STONE so that the player can insert, in 
place of the parity, a predicate P(N) which will determine 
whether or not the player (opposing the machine) wins. Thus: 


EQ (REMDR (N, 2) ) 


as the predicate P(N) indicates that the player will win if he 
has an even number of stones. Also 


(GE(N,5) LE(N, 10)) 


indicates that the player will win if his Eotat is within the 
range (5,10). 


Cee eee 

{ Exercise 17.11 | How many symmetries are there to the 4X4x4 
t____________s =>Tick-tack-toe game (i.e. classic 3-D 
Tick-tack-toe) ? How about a 3X3x3. board? 


oe ae re ne eee 
{ Exercise 17.12 | Modify TTTM and TTTV and rewrite NEXTBD 
U—___________—-I._ for the following game. The board is 


3X3X3, moves are like Tick-tack-toe and a winning pattern is: 


x | { x 


on any Of the 6 sides or in any of the 3 slices parallel to a 
side through the middle or in any of the 6 slices through the 
diagonal. 


eS 
{ Exercise 17.13 { Consider a three-dimensional cube, 3X3x3 
bt 6C with )6one) €6corner subcube removed leaving 
exactly 26 subcubes. How many symmetries of this cube are 
there? 


SS ee ee 

{ Exercise 17.14 | With the help of QUEST and a nice board- 
4 printout function, complete the Tick-tack- 
toe game (Prog. 17.4). 


ee ee Sn ee 

| Exercise 17.15 | One way of speeding up TICTACTOE is to not 
J look further when a move is found which 
results in victory. Implement this (Hint: it requires adding 
one instruction to TTTM.) 


Gon scare chee eal 
{| Exercise 17.16 {| To play 3D Tick-tack-toe on a 4X4X4 board, 


L-—__________-J one needs to limit somehow the depth of 
search. If the depth of search is limited, one needs a 
heuristic for evaluating a board. Use the following scheme. 
Assume that it is X's move. For every X find the lines passing 
through it not already blocked by an 0. If it stands by itself 
in a line add 1. If it stands with another add 3. If it 
stands with two others, add 10000 or some other such large 
number as this would imply victory. Do a Similar evaluation 
for oO and subtract the two amounts. Modify TTTV to use this 
evaluation whenever the global variable FNCLEVEL reaches the 
value of the keyword &FNCLEVEL. The global variable is of 
course set by the main program. 


en on ee ee ee 

| Exercise 17.17 { Let H be a hand of cards as in CARDPAK. 
L_______._-_____JI_ Suppose we wish to sort the cards in the 
order of increasing value (ignoring suits). How could the 


function ORDER be modified to accomplish this? 


CST ee ee Te 

{ Exercise 17.18 {| Modify the CARDPAK functions so that they 
L_—_________________J. are operative with a pinochle deck (48 
cards, Ace-9 (twice) of each suit). 


We gtr re ye a eg 

{ Exercise 17.19 {| A bridge hand is evaluated for high-card 
LJ points by assigning 4, 3, 2, 1 points 
respectively to the A, K, Q, J. In two statements, randomly 
shuffle and deal a hand, and determine and print its value. 
You may use COUNT (Prog. 3.4). , 


Gan nS ee PL RON 
{ Exercise 17.20 | Modify POKEV (Prog. 17.6) so that it 
______.._-___—_—_—---J._ _ evaluates a three-card poker hand. Note 


that straights and flushes do not count extra but that a 
straight-flush counts higher than either a pair or trips. Use 
the values 0.83, 0.955, and 0.978 as the probabilities of get- 
ting a bust, a pair or lower, and three-of-a-kind or lower 
respectively. 
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| Exercise 17.21 | If we were playing with three decks, so 
J. that duplicate and triplicate cards could 
actually be obtained in a single hand, POKEV would no longer 
be monotonic. Why? How would you modify POKEV so that it 
would work with any number of decks? 


cc 


an prenatal 
{ Exercise 17.22 | Write a function POKUNVAL which will be an 
______________--J._ approximate inverse of POKEV. That is, 


given a real number in the range (0,1), POKEV(POKUNVAL (xX) ) 
should approximate X. 


a a ea ae aan | 

{ Exercise 17.23 { POKEV is not especially uniform over the 
L_______ ds hCrrange of hands categorized as two~-pairs. 
Fix up POKEV so that it regards (W V) as a number in a com- 
binatorial number system rather than in a radix system. 


Coo ee 

| Exercise 17.24 {| Assuming that both players are playing op- 
t_______________1 timally, label the branches of the flow- 
chart for cold-hand poker (Figure 17.5) with comparisons of 
the values of their hands against expressions involving the 
parameters A, B, C, etc. Modify POKER so that it plays an op- 
timal gqame for X, rather than Y. 


Cre ne te ee 

{ Exercise 17.25 {| If we were not concerned with losing op- 
t_.-____________.__1. timal behavior, we could, by adding just 
one statement to POKER (Prog. 17.7), permit the player to bet 
any amount up to the maximum allowed. Give an example of such 
a statement and indicate where it should be placed. 
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he development of the stored-program machine is 
thought to be of importance because it allows a 
program to modify itself. Today, index registers ob- 
viate the necessity for a program to be self-modifying 
so that the practice is not only considered non- 
important (witness the growth of pure procedure) but is 
considered harmful as an obscuring practice. The real and 
lasting significance of stored program is that it allows 
programs to produce other programs (if most machines still had 
plug-board control, the output of a 'compiler' would have to 
be a wired-up plug-board or a wiring diagram and a congenial 
and dextrous computation staff). 


a 


It is therefore no coincidence that assemblers began appearing 
at about the time of the first installations of stored-program 
machines (circa 1950) and compilers (originally called 
automatic coders) and interpreters began to be developed 
shortly thereafter. This marked for the first time in the 
history of mankind the development of artificial languages; 
languages which would be literally and unfailingly obeyed by a 
mechanical servant; languages whose constructs and convolu- 
tions are subject oniy to the requirement that a translation 
algorithm be written for the language. Alas, this turns out 
to be one of the major obstacles to creating languages which 
are powerful and congenial, since it is no simple task to 
describe how to convert an arbitrary language into efficient 
code. This not only makes it difficult to implement large 
languages efficiently, but also makes it difficult to formally 
describe a large language. 


This chapter is devoted primarily to the task of describing 
how language translators of one kind or another can be written 
using the SNOBOL4 language. Compiling and assembling are 
primarily string processing activities and so it is not sur- 
prising that SNOBOL4 should be particulary helpful along these 
lines. But actually it is by no means obvious how to employ 
the powerful pattern matching operations to parse languages. 
In fact, Griswold [1974, p. 11] says that "patterns derived 
from grammars are of little use in such [i.e., parsing] 
problems." We will show, on the contrary, that we can almost 
directly map a formal grammar into a parsing pattern and that 
SNOBOL&Y patterns are particularly applicable to the parsing 
task. 


Traditionally, SNOBOL processors have had a tendency to be big 
and slow and for this reason applications have tended to hover 
about the periphery of linguistic translation in such chores 
as bootstrapping, pre-processing, macro pre-passes and in 
general software which has a small user population and high 
development costs. But the more recent implementations of 
SNOBOL4 (viz. SPITBOL, SITBOL and FASBOL) have greatly exten- 
ded the practical application of SNOBOL4 while the great 
proliferation of languages and machines has extended the need 
for such applications. Also, SNOBOL4 has often been used to 
teach compiler-writing because it simplifies the task suf- 
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Machine M is a word-addressable machine with 32 bits per 
word. All instructions have the format: 


co OO OOOO" 91 


| OP-code 1! AC { X { A { 
a Sc 
Bits 0-7 8-17 12-15 16-31 


There are sixteen general purpose registers which can 
serve both as accumulators for arithmetic and as index 
registers for address modification. The AC (accumulator) 
and xX (index register) fields are four bits for the pur- 
pose of specifying one of these sixteen registers. The 
maximum number of words for the machine is 2!6 so that the 
A (address) field can specify absolutely any address in 
the machine. The effective address, E, for any instruction 
is the sum of the index register (X) plus the value of the 
A field. We will refer to the contents of location E as 


C(E). If E is less than 16, a register is the assumed 
location. If the X field is 0, no indexing is assumed. 
Thus, Reg. 0 cannot be used as an index register. . In the 


description of OP-codes which follow, AC will refer to the 


accumulator referenced by the Ac field. 


Mnemonic Code Instruction 
(Hex) 

LOAD 21 Load C(E) into Ac 
STORE 22 Store AC into location E 
ADD 31 Integer add C(E) to AC 
SUB 32 Integer subtract C(E) from AC 
MUL 33 Multiply C(E) to AC (Overflow lost) 
DIV 34 Integer divide C(E) into AC 
FADD FY Floating add C(E) to AC 
FSUB 72 Floating subtract C(E) from AC 
FMUL 73 Floating multiply C(E) to AC 
FDIV 74 Floating divide C(E) into AC 
LOADA 2A Load effective address E into AC 
LOADN 2F Load -C(E) into AC 
BR AO Branch to location E 
BRGT A1 Branch to E if AC is > 0 
BRLT A2 Branch to E if Ac is < 0 
BREQ A3 Branch to E if AC is = 0 
BRNE Au Branch to E if Ac is # 0 
BRGE AS Branch to E if AC is 2 0 
BRLE A6 Branch to E if AC is < 0 


Figure 18.1 


A description of machine M. 
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ficiently to allow the student to complete a compiler in a 
term. By using SNOBOL4Y many of the by-now routine tasks of 
lexical and syntactic analysis are quite easily accomplished 
permitting attention to be focused on more difficult aspects 
of the translation task. 


Since we will be involved in this chapter with assembling and 
compiling it will be helpful to fix on a particular machine. 
The machine whose instruction set is described in Figure 18.1 
will be referred to as machine M. It will be used as an exam- 
ple machine throughout. 


ee ee 
Program ASM is an assembler for machine M. Each word 


1{ ({ 

| 18.1 {! of the machine can be represented by 32 bits 
1 ASM | or 8 hexadecimal digits or, if SALPHABET has 
4 size 256, 4 characters. We will presume that 
our assembler is only required to punch hexadecimal digits on 
cards, one word per card. Other output formats are rather 
easily obtained using conversions from Chapter 2. Our assembly 
language will consist of instructions in the following format: 


Label Op AC,A(X) Comment 


The four fields indicated are separated by blanks. Absence of 
a label is denoted by a blank in column 1. If AC (and/or the 
comma) is missing, 0 is assumed. If the '(X)' is missing, 0 
is assumed. The comment may be missing; if the Op field is 
present, the operand (3rd) field must also be present. If the 
Op field is missing, no instruction is generated; thus labels 
may appear on separate lines. The Op field may contain any 
Mnemonic shown in Figure 18.1. 


Perhaps the most important single observation one can make 
about an assembler is that it is inherently a two-pass system. 
This is because it is impossible to assert a maximum length 
for the sequence: 


STORE ALPHA 


ALPHA 


Hence addresses such as ALPHA are resolved in the first pass 
based on their location; instructions are translated on the 
second pass. 


The essence of assembling is associative look-up. There are 
two distinct reasons for this. It is (by definition) easier 
to remember a mnemonic such as 'LOAD' than an op-code such as 
21'. But. aside from this it is necessary to have symbols 
(such as ALPHA in the above sequence) whose meaning is 
resistant to perturbations of the program (such as insertions 
or deletions of instructions). The associative lookup is nor- 
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mally accomplished in most assemblers with the help of some 
form of symbol table as described in Chapter 11. In SNOBOLG, 
we will use the TABLE datatype to serve this purpose. 


ee ee ee ey ee ee ee 
{ This is a simple assembler for the machine M (Figure 1). | 
{ First we initialize a table (OPS) with the operators and | 
| their codes. | 
5 ERE ac TE eR RE eC eT | 
LIST = ‘LOAD 21,STORE 22,ADD 31,FADD 71,SUB 32,' 
+ ‘'FSUB 72,MUL 33,FMUL 73,DIV 34,FDIV 74,LOADA 2A,LOADN 2F,'* 
+ *BR AO, BRGT A1,BRLT A2,BREQ A3,BRNE A4,BRGE A5,BRLE A6,! 


OPS = TABLE() 
OPS_INIT LIST BREAK(* ') . OP ' * BREAK(',') . CODE ',! = 
+ :F (INIT 1) 

OPS<OP> = CODE : (OPS_INTT) 


We are ey ee ee ep ee ee ee ee age ee tee ee ee ee ee a ae 
| Initialization for Pass 1. SYMS is a table to hold user | 
{ symbols. LOC is our location counter. We assume I/0 unit | 
{ no. 10 is available for scratch storage. { 
ests ese cre eco paso dave autism mia cs anemia stam eentense cameo ieatitcenemneisatinagiconatatsaeniedl 
INIT1 SYMS = TABLE() 

LABEL.~L = BREAK(' ') . L SPAN(' °) 

Loc = QO 

OUTPUT (. DISK, 10) 


gg a aaa ES RT ELS EN EEE RE EE SLES | 
{ Loop for pass 1. Evaluate all symbols. | 
SR an tert ec oP Se sea let aba Ee A EN OE SOLE Renn ROR E SITTIN 


PASS1 xX = INPUT ¢ * :F (INIT2) 
DISK = xX 
X LABEL.L = 
SYMS<L DIFFER(L)> = BASEB(LOC, 16) 
LOC = DIFFER(X) Loc + 1 : (PASS1) 


Gy re ee ee ee PE EE a ee me Te TE EER ee a te ee 
{ Initialization for pass 2: set up a big pattern | 
{ (P.OP.AC.A.X) to crack fields. { 
[Sc EN | 


INIT2 REWIND (10) 


DETACH (. DISK) 

INPUT (.DISK, 10) 

NO_OP = POS(0) BREAK(* ') SPAN(* ') RPOS(0) 
P.OP.AC.A.X = NULL $ OP $ AC $A $X NULL . CAUSE 


+ POS(0)  BREAK(' ') SPAN(' ') 

+ BREAK(' ') . OP SPAN(' ‘) 

+ (BREAK(* ,*) . AC ',* { NULL) 

+ BREAK(*( ') . A 

+ ("(" BREAK("')") . X '")* |= NULL) 


We define a generalized convert-symbol routine (CVTSYM) 
which converts a symbol according to a given symbol table 
(TABLE) producing a hex string of length LENGTH. TYPE in- 
dicates the type of symbol for diagnostic purposes. CAUSE 
is a glokal error-bearing variable which is printed on the 
listing. 'uft means undefined symbol in field f. ‘Lf! 
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{ means length of field f is too long. { 
a ican epee cna aga lige ages peels epoca 


DEFINE ('CVTSYM(SYM, TABLE, LENGTH, TYPE) ') : (CVTSYM_END) 


CVTSYM SYM = INTEGER(SYM) BASEB(SYM, 16) :S (CVTSYM_1) 
SYM = TABLE<SYM> , 
CAUSE = IDENT(SYM,NULL) 'U' TYPE ¢ ¢ 
CVTSYM_1 
SYM = LPAD(SYM,LENGTH,'0') he 
CVTSYM = LE(SIZE(SYM), LENGTH) SYM  :S (RETURN) 
CAUSE = CAUSE 'L' TYPE ¢ # 
SYM = 2 (CVTSYM_1) 
CVTSYM_END 


Ge ee tg ee ee eg bene Od Ee eee a Te Pa Ae ee Te ee a ee ee 
{ We now go into the pass 2 loop. We tentatively set our |{ 
{ error indicator (CAUSE) to syntax error (S). { 
[REE ne ee Ee | 


PASS2 CAUSE = 'S ? 
LINE = DISK * ! : F (END) 
LINE NO_OP :S (PASS2A) 


= CVTSYM(OP,OPS,2,'O*) 
AC = CVTSYM(AC,SYMS, 1,'R*‘) 


x CVTSYM(X, SYMS, 1, 'X") 

A CVTSYM(A, SYMS,4,'A') 

PUNCH = OP AC X A . 

OUTPUT = RPAD(CAUSE,15) OP *t * act * x tt a 
+ ‘ ' LINE _ (PASS2) 
PASS2A OUTPUT = DUPL(' ',32) LINE : (PASS2) 
END 
Names_referenced Name Type Where defined 
by_ ASM: RPAD Function Program 3.3 
. BASEB Function Program 2.4 
Epilogue 


Note that when an error occurs an instruction is generated in 
any case with one or more fields zeroed. This is so that sym- 
bols that are resolved by the assembler will have their cor- 
rect value and that an assembly with one or two small errors 
may nonetheless be a valid assembly for debug purposes. 


The assembler is a very primitive one lacking many ‘bells and 
whistles! of a commercial product. Extensions such as data 
generation statements, expressions, relocatability, psuedo- 
ops, conditional assembly and multiple-location counters can 
be added, however, without a major overhaul of the program 
structure. For a more detailed discussion of assembler im- 
plementation, see Donovan [ 1972]. 


€¥8€ ompiling using SNOBOL4 | There has been much written 
on the subject of compilation 
{ and parsing in the past several years. Much of this 
{! writing is theoretical and most is devoted to a 
£86 { thorough analysis of parsing; i.e., the decomposi- 
i$. tion of an input into its linguistic components. For 
example, the recognition that the source language string: 


A = BETA + C * DELTA 
is of the form: 
VARIABLE = EXPRESSION 


and that EXPRESSION is of the form TERM1 + TERM2 and that 
TERM2 is of the form FACTOR * FACTOR, may be regarded as 
parsing the original string. Parsing is an essential component 
in the translation not only of computer languages but of 
natural languages as well. 


It has long been recognized, however, that parsing comprises 
only a portion of the compilation process and not the dominant 
portion by any means. This is especially true in SNOBOL4 where 
pattern matching makes parsing quite automatic, as we will 
see. On the other hand, techniques for generating efficient 
object code from a fully parsed statement are not well under- 
stood and are often embedded in compiler listings and nowhere 
else. Some of these methods have been distilled into English 
and can be found in Gries [1971], Donovan [1972], Graham 
(1975] and McClure {1972}. . 


We have introduced in a previous chapter the BNF (Backus Nor- 
mal Form) for representing sets of strings or languages. As 
an example, the grammar shown in Figure 18.2 can be used to 
define a simple language which we will refer to as I,. Ty 
contains only assignment statements, the four fundamental 
(binary) arithmetic operations, and negation. Identifiers 


<IDEN>: : =<LETTER> | <IDEN><LETTER> | <IDEN><DIGIT> 
<INTEGER>: :=<DIGIT>|<INTEGER><DIGIT> 
<PRIMARY>: :=<IDEN> |< INTEGER> | (<E>) 
<FACTOR>: : =<PRIMARY> |-<PRIMARY> 

<TERM>: : =<TERM>*<FACTOR> | <TERM>/<FACTOR> | <FACTOR> 
<E>: s=<E>+<TERMD | <E>-<TERM> | <TERM> 


<STMT>: : =<IDEN>=<E> 
ee Pc ENE a 


Figure 18.2 


A BNF description for the language L,. 
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We will assume that the reader is already acquainted with BNF. 
He has undoubtedly been. exposed to this or similar notation 
when -learning the constructs accepted by a programming 
language or indeed any other linguistic system such as an 
operating system command language or an editor's command 


language. This notation can be directly mapped into SNOBOL4 
patterns so that any syntactic variable is associated with 
some pattern. In fact Exercise 18.9 invites you to write a 


program to carry out this translation automatically. 


One difficulty with a BNF description is that languages that 
it is used to describe are typically not context free. Thus 


A(3) = 17 


May or may not be valid in Fortran depending on declarations 
for A. Pure BNF cannot be used to decide the issue. Such 
context dependencies are generally treated by the addition of 
a symbol table, with appropriate insertions and checks; in 
this way the language can be treated as context free, even 
though it is in fact not. Dynamic function evaluation can be 
used in SNOBOL4 to make these checks. Thus, for example, if 
the function ATFST(X) will test to see if its argument is an 
array and if ID is a pattern to match identifiers, then 


ID $ X  *ATEST(X) 


will match only array identifiers. The function ATEST() can 
be written using symbol tables as were needed in ASM. Routines 
such as ATEST() are often erroneously referred to as semantic 
routines. They are not, for their purpose is to extend a con- 
text free formalism to handle context sensitive situations. 
It would be more correct to use the term syntactic routine for 
any routine used to decide syntax. We will reserve the term 
semantic routine for routines which have a_ side-effect other 
than recognition such as code production or error-message 
generation. . 


The semantics of a language described using BNF, i. e. the 
‘meaning of the various linguistic constructs, are seldom 
defined formally. For the language L,, for example, we may 
say that all arithmetic operations represent operations on in- 
tegers of a precision equal to that of the target machine. 
Most readers, especially those already exposed to Fortran-like 
languages, will then understand the meaning of Ly. While this 
is true of a simple algekraic language it may not be true if 
the language is neither algebraic nor simple. Formal systems 
to describe semantics are of two kinds, concrete and 
theoretical. A concrete system is one which has been subject 
to the rigors of machine implementation; a theoretical system 
is one which purportedly could be, but which for some reason 
has not. Concrete systems (listings) are messy; theoretical 
systems are at least buagy and at worst severely distorted. 
The answer to this dilemma may lay in the development of 
compiler-compilers which compile inefficiently and produce 
inefficient code but which yield sufficiently simple listings 
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Program 18.2 - L-ONE 


that they may be understood. Much of this chapter is dedicated 
to the ultimate fullfillment of this pious hope. 


{{ Program {f{ L_ONE is a compiler for the language ly 
1! 18.2 tt (Figure 18.2). The output is in the form of 
| UI assembly language (accepted by ASM) for 
tJ machine M (Figure 18.1). The implementation 
of L_ONE is based on a method of employing semantic routines 
during a pattern match, a technique suggested to the author by 
M. J. ROchkind (Bell Laboratories, Raritan River, N.J.). This 
method is based on the observation that a routine invoked to 
generate code (as opposed to one used to supplement the match 
as aqiven above in the case of ATEST) is best done using con- 
ditional assignment. This defers any code production until 
after the match thus guarding against premature production. 
For example, consider the pattern 


Pl. *A() P2. *B() [  P3. *C{) (18.1) 


If P1 and P2 match, then A() and B() are called. If P1 match- 
es and P2 fails but P3 matches, then only C() is called. AQ 
is not called in this case because backup on failure removes 
the conditional assignment as was fully described in Chapter 
7. This is, of course, exactly what we want and will greatly 
reduce the complexity of a compiler written in SNOBOL4. The 
reduction in complexity is worth the fact that we are using 
conditional assignment in a way completely unintended by the 
originators of the language. Functions called in this way are 
supposed to be returning names and receiving values; they do, 
but the names are dummy names and the values assigned are 
irrelevant. 


It will be more convenient to have only one semantic routine, 
viz. S_(name), where name is the name of a routine. Thus, 
instead of writing 

P1. *A() 
we will write 


P1. *S_(‘A‘) 


But this is a bit messy, so we will write a routine S(name) to 
return NULL . *S_(name) so that we may write 


P1 S('A‘) 


to achieve the same effect with a cleaner appearance. The 
above pattern (18.1) is then written: 


P1 S('A') P2 S('BY) 1 P3 s('cty 


Finally, we can scan and push an element all in the same pat- 
tern by the construction: 
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PAT . *PUSH() 


where PAT matches the string pushed (See PUSH, Prog. 5.5). The 
semantic routines produce code by popping the stack for the 
location of the previous result, producing code to compute a 
new result, and pushing onto the stack the location of the new 
result. 


en ee ee ee 
| The program L_ONE will compile statements of L, into as- | 
{ sembly language for machine M. In the semantic routines | 
{ below, there is a label S_op for each operation op. | 
| EEE TNA ee NS OC a eS ee Oe ee aE ete | 


DEFINE ('S (NAME) ') 


DEFINE ('S (NAME) T') : (S_END) 
s S = EVAL("NULL . *S_('" NAME "¢y ") : (RETURN) 
Ss Ss. = .DUMMY 2($("'S_" NAME)) 


S_NEG OUTPUT = ' LOADN ' POP() 
OUTPUT = ' STORE ' PUSH (TEMP()) : (NRETURN) 


S_ADD ;S_SUB ;S_MUL ;S_DIV 


T = POP() 
OUTPUT = ' LOAD ' POP() 
OUTPUT = ' * NAME '¢ T 
OUTPUT = ‘' STORE ' PUSH(TEMP()) : (NRETURN) 


S_ASGN OUTPUT 
OUTPUT 


' LOAD * POP() 
STORE ' POP() : (NRETURN) 


S_END 


Wore en ee Re oa pee ae ea ee ne Peg me eh ae Ge tte Tie ET ee eee a ES ge ee Re go TN 
{ The following patterns will match the syntactic variables | 
{ of the language L, and call the appropriate semantic | 


{ routines. | 
cs ct crest eid ds ei ns nm SG isis ae ns rr ems ipa eaten 


LET = ‘ABCDEFGHIJKLMNOPORSTUVWXYZ '! 

DIGITS = ‘'0123456789° 

IDEN = (ANY(LET) (SPAN(LET DIGITS) {| '"')) . *PUSH() 

INTEGER = SPAN(DIGITS) . *PUSH () 

PRIMARY = IDEN { INTEGER { '(' *E ')! 

FACTOR = PRIMARY | '-' PRIMARY S('NEG‘) 

TERM = *TERM '**' FACTOR S(*MUL') 
+ *TERM '/' FACTOR S('DIV') { FACTOR 

E = *E '¢" TERM S("fADD') | 
+ *E '-' TERM S('SUB') {| TERM 

STMT = POS(0) IDEN ‘'=" *E S("ASGN') RPOS(0) 
a a DSA aaa CSR RG AEE) | 
{ TEMP() is always ready to provide us with a new temporary | 
{ location. 1 
icc nests sensed a oarsmen ins snesemnacunsisoesatsiaiabiliubamsecsiaeaaomandl 

DEFINE ("TEMP () ') : (TEMP_END) 
TEMP TEMP_NO = TEMP_NO + 1 . 

TEMP = ‘'TEMP' TEMP NO : (RETURN) 


TEMP_END 
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ica gaer SCael RRiGT= ROC AL,| 
{| The main program is essentially a single pattern match. { 


22 EE Rei eee aE 
READ S = TRIM(INPUT) :F (END) 
REMOVE_BLANKS Ss tt = :S (REMOVE_BLANKS) 
TEMP_NO = 0 
Ss STMT :S (READ) 
OUTPUT = '*** ERROR IN * Ss : (READ) 
END 
Names_referenced Name Type Where_defined 
by L ONE: PUSH Function Program 5.5 
POP Function Program 5.6 


As a simple example, the input 
A = B-cC * D 


will produce the output 


LOAD Cc 
MUL D 
STORE TEMP1 
LOAD B 
SUB TEMP 1 


STORE TEMP2 
LOAD TEMP2 
STCRE A 


The resulting code is clearly non-optimal but it gets the job 


done. There are numerous extensions that one can incorporate . 


into L_ONE to produce more efficient code and to provide more 
features. Some of these have been left as exercises. 


The reader should not be misled by the simplicity with which 
L_CNE was written into believing that full-fledged compilers 
for complete languages can be had cheaply. In general, the 
complexity of a compiler will grow nonlinearly with the in- 
troduction of new features. The world is full of compiler- 
compilers that look good for toy languages but which don't 
quite stand up to the hammering of a full scale language such 
as, for example, PL/I. The mere fact that declarations in PL/I 
can follow use is enough to discourage the one~pass approach 
used in L_CNE. For big compiling, we must step back a bit and 
proceed in stages. 


Ss ee ee a eg ee a ee 

{| *£€%% artitioning the compiler { A compiler is generally 
1* § -—————————' decomposed into lexical 
| ##8% | analysis, syntactic analysis, code optimization and 
{ & { code generation. The latter two are often inter- 
{ % { twined in more than two passes for good reasons, as 
u——————4 we shall see later. The first two of these phases 
is indicated in Figure 18.3. 
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(a) ALPHA = BETA + GAMMA ** 2 
Sn aa ee oe 
(b) {ALPHA|({ = {{(BETA{{ + [(GAMMA|{ ** {| 2 | 


fen bees boa LI Ld LS 


ae. 
| = 
t SS { 
{ | 
| { 
| ees, | 
{ ALPHA { ee ee ee] 
(c) { { 
{ { 
c$°reooOror—t a | 
| BETA | ee Set 
{ ! 
| { 
c-eorrl 
{ GAMMA | {21 
| SESE | [| ee | 
Figure 18.3 


A lexical analysis (b) and a syntactic analysis 
(c) of an input string (a). 


Lexical analysis decomposes the source string into indivisible 
tokens (or atoms). These tokens are, of course, not literally 
indivisible since they are, after all, comnosed of characters, 
but they are indivisible in the sense that no further decom- 
position has any meaning with respect to compilation. Thus, 
the meaning of 'ALPHA' is not a composition (homomorphism) of 
the meanings of its individual characters (though its sound 
may be). On the other hand, the meaning of 'ALPHA + BETA’ can 
be interpreted as a composition of the meanings of the three 
tokens 'ALPHA', '+' and ‘BETA’. The distinction is very much 
like the distinction between morpheme and phoneme in the study 
of natural languages. It is actually a kind of mixed radix 
system whereby a relatively small number of different symbols 
(letters or phonemes) is used to compose a fairly large (but 
finite) number of different notions (words or morphemes). 
Sentences are then built from the words. Evidently there are 
more ideas than sounds. : 


When SNOBOL4 is used to compile a programming language, no 
distinct lexical pass is required. On the other hand, the in- 
put may have to be massaged (pre-processed). In L_ONE this 
amounted to removing blanks. In a real language such as For- 
tran, blank removal is not nearly so simple as we will see 
(BLANKS, Prog. 18.3). In PL/I the pre-processing may consist 
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of the extraction of the next statement (see PLI.STMT, Prog. 
8.10) and the removal of comments. Redundant blank removal is 
not nearly so necessary for PL/I as it is for Fortran (since 
identifiers cannot be split in PL/I). 


The result of a syntactic analysis is the tree structure shown 
in Figure 18.3. This tree structure may be represented in any 
of a variety of ways, most commonly as a linked structure. In 
SNOBOL4 the tree is perhaps best represented as a string in 
Polish prefix form (as described in Chapter 9) because pattern 
matching may then be exploited to effect desired transforma-~ 
tions. 


It is convenient to separate out that portion of a compiler 
which is machine-dependent simply to avoid duplication of ef- 
fort if the same compiler is needed for a different target 
machine. The tree structure of Figure 18.3 is clearly machine 
independent, and code generation is clearly machine-dependent. 
What of code optimization? 


According to McClure [1972], the two most effective means of 
code optimization are common subexpression removal (from ad- 
dress calculations) and register allocation. An example of 
the first is the removal of the common subscript calculation 
in: 


A(I,J) = A(I,d) + 1 


Removal of common subexpressions is machine independent and 
can be effected by transformations applied to the tree struc- 
ture. On the other hand, register allocation is clearly 
machine dependent and must be done at some later stage. 


It is very common to have some intermediate machine- 
independent form between the tree structure and the resulting 


code. This is to push the machine independence as far as 
possible. Hence the intermediate form is a kind of least com- 
mon multiple of all machine languages. The original macro 


implementation of SNOBOL4 was actually written in such a 
language. The most extensive (or perhaps intensive would be a 
better word) of this kind known to the author is being 
developed by Robert Dewar (I11l. Inst. of Tech., Chi., I11.) in 
connection with a machine-independent implementation of 
SPITBOL. Dewar's motivation is to produce a macro language 
which will lose little to efficiency when expanded on a given 
machine. 


One of the more common intermediate forms is the four-tuple. 
.Four-tuples consist of an operation followed by two operands 
followed by a destination all separated from each other by a 
convenient break character such as a comma. For example: 


ADD,L1,1L2,13 
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would mean add the contents of L1 and L2 and store the result 
into L3. We will assume that the locations can be indexed by 
other locations. For example: 


MUL, A(TEMP2) , TEMP3, TEMP4 
would reference as the first argument the location A offset by 


the current value of TEMP2. This could ke rendered in machine 
M code as: 


LOAD 1, TEMP2 
LOAD A(1) 
MUL TEMP3 


STORE TEMP4& 


An optimized version of this code may not actually contain the 
initial LOAD or the STORE. This will depend on the origin of 
TEMP2 and the destination of TEMP4. 


Hence we may decompose a large processor into the following 
phases (as opposed to passes since several phases may actually 
go on in the same pass). 


1. Pre-processsing 

2. Syntactic analysis 

3. Tree transformations and global optimization 
4. Intermediate language production 

5. Final expansion and detailed optimization 


Ceo ee ee, ; 
{{ Program {| The function BLANKS is an example of pre- 
{1 18.3 | processing that may be required when com- 
{{ BLANKS | piling a full language. BLANKS (S) will 
Ld remove blanks from a Fortran statement 


provided as argument. We assume a function such as FORTREAD 
(Prog. 9.2) is available to read in a statement and handle 
continuation. Removing blanks sounds simple but is complicated 
by the fact that blanks within string literals may not be 
removed. A string literal in ANSI Fortran has the form 


nH<n-characters> (eg. 3HCAT) 


String literals may only appear in FORMAT and CALL statements. 
But we cannot simply go looking for this pattern in such 
statements because the indicated pattern may appear as part of 
an identifier (which may also be an argument of a subroutine 
call). For example: 


CALL ALPHA (A1H) 
contains no literal. Hence we must ignore such sequences which 
follow alphabetics. Another problem is that blanks may be in- 
terspersed in and around the length indicator. For example: 


1 2 HABCDEFGHIJKL 


is a valid literal. This makes it difficult (but, as we will 
see, not impossible) to write a single pattern to match a 
literal. 


If we depart from the relatively rarified air of the ANSI 
standard and enter the domain of a practical compiler, we 
encounter more problems. IBM's OS/360 Fortran {IBM 360j] is 
typical of many Fortrans and so we will assume this to be our 
source language. With respect to blank removal, this Fortran 
has the following additional properties: 


(1) A literal may be designated by the sequence '...' as 
well as by the nH<n-character> sequence. 


(2) Function calls (as well as subroutine calls) may con- 
tain literals. 


(3) The READ and WRITE statements may be direct access in 
which case they have the form: 


emd(f ' exp ... 


where cmnd is READ or WRITE, where f is an integer or 
an identifier designating a file and where exp is an 
arbitrary expression designating a record number. 


Now (2) implies that all arithmetic expressions (including the 
exp portion of (3)) can potentially contain literals. 
Therefore READ and WRITE statements must be handled specially. 
A logical IF statement has the form: 


IF( exp ) stmt 


Here we must check to see if stmt is a READ or WRITE statement 
but our check is complicated by the fact that in order to find 
stmt we must determine where exp ends. To do this we must 
maintain a parenthesis count ignoring parentheses that are 
within literals. This can be done by recursion in a manner 
reminiscent of the BAL function (Prog. 8.3). 


We might say a word at this point as to why we wish to go 
through so much trouble to remove blanks. For one thing, the 
blank removal process can be used not only for compiling but 
for many other kinds of pre-processing, data laundry, etc. 
that require pattern matching of Fortran programs. Hence it 
saves duplication of effort if it can be done once and for 
all. Another reason is that keywords, identifiers and many 
other non-decomposible units can have blanks interspersed 
within them (however improbable that may be) which will prove 
difficult to pattern match. For example, the keyword READ may 
be written as ‘R EA D's; to match this we may write: 


OPTB 
READ 


SPAN(* ') | NULL 
'R' OPTB ‘Et OPTB "A! OPTB 'D! 


but this is as troublesome as it is inefficient. 
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(Sie ee ee ee ee ee ey 
{ BLANKS (S) will return the result of removing blanks froma | 
{ Fortran statement provided in S. BLANKS (S) will operate | 
| correctly for 0S/360 Fortran [IBM 360g]. The statement is | 
| presumed to have had its label removed by previous | 
{| processing. | 
stenoses slim snp iiss ein rss ie ih emesis iri eeacanelh 
DEFINE (" BLANKS (S) IF, KW, STMT, IO") 


Q 
ALPHA = ‘ABCDEFGHIJKLMNOPORSTUWXYZ' 
NUM = '0123456789! 


Cag ae fe ae ee pe Re a ee La pened a Te Be ee pe eae, 
| FBAL will match a string balanced with respect to paren- | 
{ theses but will ignore parentheses within literals. We | 
{| will use backup-free scanning (i.e. the ARBNO(P FENCE) | 
{ construct) as established in Chapter 6. { 
a 


BLINT = ANY(NUM) (SPAN(NUM * ") | NULL) 

F.LIT = BLINT $N 'H* LEN(*DIFF(N,' ')) . LIT 
+ 1 Q BREAK(Q) . LIT Q 

ITEM1 = F.LIT | SPAN(*' *) | SPAN(ALPHA NUM ' *) 
+ { LEN(1) 

SEARCH.LIT = POS(0) ARBNO(ITEM1 FENCE) . TEMP F.LIT 

ITEM2 = ‘'(* *FBAL ')* | ITEM1 

FBAL = ARBNO(ITEM2 FENCE) 


Cr ee oe Te ae ee ee 
{ The function BL(S) will remove all blanks from S except | 


{| those in literals. i 
Caceres sree eset ip ier SSO Ss > fs esse lS Psht-<ss nsf sia ane eeeanemaemnaneerel 


DEFINE ('BL(S) LIT, TEMP") : (BL_END) 
BL S SEARCH.LIT = :F (BL_1) 

BL = BL ODIFF(TEMP,' ") "*"™ LIT wHH — : (BL) 
BL_1 BL = BL DIFF(S,' *) : (RETURN) 


BL_END . 
cc 
{| Define some patterns to scan statements containing | 


{ critical keywords. 1 
a senrnseseteenen-em-tnpnae tention nts ee tn stg Pn 0 SA cinerea eae eriion ances naa 


KWORD.KW = POS(0) SPAN(ALPHA ' (*) . KW. 
IF.STMT = POS(0) ('IF(' FBAL ')') . IF REM. STMT 
IO.STMT = POS(0) ((*READ' | 'WRITE') '(' 


+ BREAK(ALPHA NUM) SPAN(ALPHA NUM ' ')) . IO Q REM. STMT 
a: : (BLANKS_END) 


$$$ $$$, 
{ Entry point for BLANKS(S); First remove blanks from the | 


| keyword to test statement type. | 
ena ents cee cnet stunner bsenemniteya npc enantio gps ness erste desbssseesensacemmenasnenmnassisinvall 


BLANKS S KWORD.KW = DIFF(KW,' ') 
PLANKS = S 
BLANKS IF.STMT = BL(IF) BLANKS (STMT) : S(RETURN) 
BLANKS I0.STMT = DIFF(IO,' ') "'" BL(STMT) :S (RETURN) 
BLANKS = BL(S) : (RETURN) 


BLANKS_END 


Names_referenced Name Type Where defined 
by_ BLANKS: DIFF Function Program 3.10 
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{1 Program {| The method of invoking semantic routines 
{1 18.4 11 used in the coding of L_ONE is general 
i POL {1 enough but not sufficiently convenient for 
Hd very large languages of, say, PL/I size. To 
see this, consider the tree decomposition of a language state~ 
ment as shown in Figure 18.3. By means of S() a function may 
be called before and after each node of the tree with the se- 
quence of calls being made in left-to-right order. Moreover, 
every leaf of the tree may be pushed and these pushes are in- 
terspersed between calls also in a left-to-right fashion. We 
could hardly ask for anything better, or could we? 


The reader will find, if he does the exercises involving ex- 
tensions to L_ONE, that he will be forced to push and pop many 
different items in order to preserve quantities from the start 
of a syntactic unit across to its termination. For example, 
to produce code for IF<E>THEN<S> we must create a conditional 
branch across the THEN-clause. For this we will need to create 
a label which will be used in two places, before and after the 
<S>. Since <S> may be arbitrary including another IF<E>THEN<S> 
sequence the label cannot be assigned to a variable but must 
be pushed and popped. Now if the functional relationship fol- 
lowed the structural relationship we would regard IFTHEN as a 
Single node of a tree with two arguments <E> and <S>. The 
IFTHEN function would call the functions for <E> and <S> to 
obtain translations. This will prove to be more natural. The 
temporary-variable facility built into the function mechanism 
can be used instead of stacks and a somewhat cleaner implemen- 
tation results. In order to achieve a functional relationship 
conforming to the structural relationship the source string is 
converted into a tree form; our tree will be Polish prefix. 


To obtain a slightly richer language to illustrate the conver- 
sion process, we define an upward compatible superset of IL, 
called Ls. This is defined in Figure 18.4 Unlike L,, we must 
allow blanks as separators (not shown in the BNF) but we do 
not permit blanks within identifiers and numbers. This is much 
like the PL/I convention whereas L, followed the Fortran 
convention. 


The form of Polish prefix for any non-leaf (a node containing 
at least one descendent) iss: 


oOperator:n,operand,, operands,...,operandn 


where each operand is itself a valid tree. The operator may 
not contain either of the two special characters colon or com- 
ma. For a leaf, the :n is absent and, of course, there are no 
operands. Thus: 


A+B * C becomes +:2,A,*:2,B,C 
and 
A * (-B) becomes *:2,A,-:1,B8 
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sR a RB aR ES 
<ELIST>: :=<E>, <ELIST> | <E> 

<REF>: 2 =<IDEN> (<ELIST>) 

<PRIMARY>: :=<IDEN> {< INTEGER> { (<E>) | <REF> 
<FACTOR>: : =<PRIMARY> |-<PRIMARY> 

<TERM>: : =<TERM>*<FACTOR> | <TERM>-<FACTOR>| <FACTOR> 

<E>: 1=<E>+<TERM> | <E>-<TERM> | <TERM> 

<RELOP> is one of '>' "Ct t€=t f>at tat tact 
<BOOL>: : =<E><RELOP>< E> 

<IPSTMT>: : =IF<BOOL>THENCSTMP>ELSE<STMT>| IF<BOOL>THEN<STMT> 
<VAR>: : =<IDEN> | <REF> 

<ASGNSTMT> : : =<VAR>=<E> 

<STMT>: : =<IFSTMT> | <ASGNSTMT> 

a a aa cami a ol ote ie ctweanatinstadall 


Figure 18.4 


The language L>. The definitions for <IDEN> and 
-<INTEGER> are the same as for L, (Figure 18.2). 


This seems ugly but it will be easy to produce, scan and 
expand. 


A functional form such as A(B,C,D) will translate into: 
REF: 2,A, COMMA: 2,B, COMMA: 2,C,D 


No distinction is made, at least initially, between an array 
and a function since declarations may follow first use. Note 
that the argument list is a sequence of 2-ary functions rather 
than a single n-ary. This form is easier to produce and just 
as easy to scan. 


To transform infix to prefix, we will use the conditional in- 
vocation of semantic routines as in L_ONE. Only two routines 
need be defined; CPUSH(STR) will conditionally push the string 
STR onto the stack (conditional upon the pattern being a part 
of an overall successful match). CPUSH(STR) actually returns: 


NULL . *S_('CPUSH', STR) 


where S_() is now written expecting an extra argument. The 
other routine is PCL(N) which causes N+1 items on the stack to 
be popped and replaced by one larger item, viz. 


OP:N, ARG, eARGo, eee 7eARGyn 


The operator is assumed to be the second last item on the 
stack. N is at least 1. 


Once the machinery of POL(N) and CPUSH(STR) have been set up, 
very large languages can be compiled with no additional seman- 
tic routines except error messages and routines to handle 
declarations. These we ignore for simplicity. We will il- 
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ae Re ce SEED me ae EY STEP CE CE 


lustrate the method by writing a pattern which will transform 
sentences Of Lg into Polish prefix. 


Qe a a ee ee 
| This program illustrates how to convert Lz into Polish | 
{ prefix using special semantic routines, viz. POL(N) and | 
| CPUSH(S) for the purpose. We first define the semantic | 
| routines. { 
ee a a er er TE | 

DEXP ("POL(N) = S('POL',N)") 

DEXP ("CPUSH (ARG) = S('CPUSH',ARG) ") 

DEFINE ('S (NAME, ARG) ') 


DEFINE ('S_ (NAME, ARG) T1,T2") : (S_END) 
S S = EVAL("NULL . *S_('" NAME "*,* ARG "*)) : (RETURN) 
Ss S_ = .DUMMY 2($(*S_' NAME)) 
S POL 12 = POP{) 
T1 = POP() 's' ARG ',®* 
S_POL1 (EQ(ARG,1) PUSH(T1 T2)) :S (NRETURN) 
ARG = ARG - 1 
T2 = POP() ',!' T2 : (S_POL1) 
S_CPUSH PUSH (ARG) : (NRETURN) 


S_END 


Qe Pe a ee a te ee Tee te ee Le a ge ares Oe 
| We now write our patterns. Interspersed blanks are handled | 
| by placing an optional blank pattern at the end of each | 
{ pattern primitive. Patterns formed from other patterns | 
| then need not worry about blanks. | 
| en Ne | 

AL = *ABCDEFGHIJKLMNOPQRSTUVWZYZ '! 

NU = '0123456789'° 
BL = SPAN(* ') { NULL 
IDEN = (ANY(AL) (SPAN(AL NU) {| '')) . *PUSH() BL 
INTEGER = SPAN ('0123456789") . *PUSH() BL 

= ANY('+-') . *PUSH() BL 

MULOP = ANY('*/t) . *PUSH() BL 


RELOP = (ANY('=<>"') | ANY('=><") '=") . *PUSH() BL 
LP = '(* BL 
RP = ‘')* BL 


ELIST = *E (',* BL CPUSH('COMMA') *ELIST POL(2) { °*) 
REF = IDEN LP CPUSH('REF') ELIST RP POL(2) 
PRIMARY = IDEN | INTEGER {| LP *E RP | REF 
FACTOR = PRIMARY | '-' . *PUSH() BL PRIMARY POL(1) 
TERM = *TERM MULOP FACTOR POL(2) { FACTOR 
E = *E ADDOP TERM POL(2) { TERM 
BOOL = *E RELOP *E POL (2) 
IFSTMT = 'IF* BL BOOL 'THEN' BL 

+ (*STMT 'ELSE' BL CPUSH('IFELSE') *STMT POL(3) | 

+ CPUSH('IFTHEN') *STMT POL(2)  ) 
ASGNSTMT = (IDEN | REF) '=' . *PUSH() BL *F POL(2) 
STMT = IFSTMT | ASGNSTMT 
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Names_referenced Name Type Where defined 
by POL: DEXP * Function Program 14.1 
PUSH Function Program 5.5 
_ POP Function - Program 5.6 


* indicates name is referenced in the initialization section. 


Epilogue 
For example, if we execute: 


‘IF A(I) > 6 THEN I = 2!  stTMT 
OUTPUT = POP() 


we will print: 


IFTHEN: 2,>2:2,REF22,A,gl1,6,=22¢l1y2 


SS a 
Program With a statement cast as Polish prefix we 


tt {| 

WW 18.5 | may enter the optional tree-adjustment phase 
1 {{ in which the tree is scanned looking for 
1_______________ patterns which may be pruned, modified or 
rearranged. There are several reasons for doing this, some of 
which are listed below: 


1. To insert explicit conversions (for mixed mode arith- 
metic, array references, etc.). 


ae To remove ambiguities (such as floating versus integer 
addition, binary versus unary minus, function 
references versus array references). 


3. - Code optimization such as common subexpression removal 
or such as replacing <VAR> = <VAR> + 1 by a single 
operator. 


Other uses for the tree adjustment phase will occur to the 
writer of a practical compiler. An important point to note is 
that the scan is generally easier to apply to the tree than to 
any other form because it is quite easy to specify a pattern 
to match a tree. The following function, TREE(P,N), will 
return a pattern that will do precisely that. For example, 


TREE('+',2) $ OUTPUT FAIL 


is a pattern that will scan for and print all binary sums in 
Polish prefix form. 
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{ TREE(P,N) will match a tree in Polish prefix form whose | 

| node value matches the pattern P and where N is the number | 

{ of branches. The tree is assumed to be a non-leaf. If N | 

| is 0, then an arbitrary number of nodes (up to some max- | 

{ imum) is implied. { 

a a a as rem i ps emacs pms See crip ces eames ni Soraieaninananmaamanoacal 
DEFINE ('TREE (P,N) ') 


ARB_TREE = TREE(BREAK(':,')) | BREAK(':,') ',? 

+ : (TREE_END) 

TREE TREE = EQ(N,0) P . 

+ (TREE(,1) | TREE(,2) | TREE(,3) | TREE(,4)) 

+ 2S (RETURN) 
TREE = P ‘st N f,? 

TREE_1 N = N- 1. GT(N,0) : F (RETURN) 
TREE = TREE *ARB_TREE : (TREE_1) 

TREE_END 

Epilogue 


The alert reader will note that the pattern requires a ter- 
minating ','. Thus, to use TREE on the Polish notation 
described above would require appending a comma to the _ total 
string. It may also be necessary to prepend a comma. For ex- 
ample, ARB_TREE is a variable which was set as a side-effect 
of initializing TREE to equal a pattern which will match an 
arbitrary tree. Then: 


POLISH = ',* POLISH '!,' 
POLISH '," ARB_TREE $ T ARB *¥T 


will scan the Polish for a pair of identical expressions. (For 
this pattern match to work it will be necessary to use 
FULLSCAN mode; in QUICKSCAN mode, ARB indicates futility as 
was discussed in Chapter 7). Several examples of the use of 
TREE have been left as exercises. 


| 1 Given a statement in Polish prefix, we can 
11 18.6 {t generally produce compiled code by recursive 
tt (1 invocation of a single translate function. 
t_———-—__-___ We will not produce code directly but will 
create four-tuples as described previously. The set of accep- 
table 4-tuples is indicated in Figure 18.5. 


Certain semantic ambiguities in the description of Lg need be 
resolved before TR can be written. Floating point as well as 
integer arithmetic will ke permitted. We assume that iden- 
tifiers beginning with ANY('IJKLMN') are integer; all others 
are real (floating point). Mixed-mode arithmetic is not per- 
mitted. The functional forms specified in the syntax of Lg, 
refer to array references; function calls are not permitted 
(but are left as an exercise). Finally, for simplicity, array 
references are assumed to be one-dimensional. The extension 
to multi-dimensioned arrays is relatively straightforward 


Page 426 Chapter 18 - Assemblers, Compilers and Macros 


4-tuple Description 


ADD, arg1,arg2,arg3 Place arg1 plus 
arg2 into arg3 


Seven similar operations 
for SUB, MUL, DIV, FADD, 
FSUB, FMUL and FDIV. 


ASGN, arg1,,arg3 Move the quantity from 
arg1 to arg3. 

MNS,arg1, ,arg3 Store -arg1 into arg3 

BR,,,arg3 branch to arg3 

BRGT, arg1,arg2,arg3 Branch to arg3 if 


arg! is greater than arg2. 


Five similar operations 
for BRGE, BREQ, BRNE, 
BRLT and BRLE 


LBL, arg1 ' Insert a label here 


argn is of the form ID or ID(ID) where ID is 
an identifier. 


If identifiers are of the form TEMPn they are 
considered volatile; i. e., they may be destroyed 
after first use. 


0 ee ee ee ee ee EE EE Ee ee ee eS Oe ee ee eS Le ee eee | 


Figure 18.5 
The tuple language. 
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given the standard multiplier technique [Gries 1971, Sect. 
8.4] but is beyond the scope of the present discussion. 


TR() will return a translation of a polish string con- 
tained in the global variable POLISH which is modified 
(and reduced to null) in the process. A trailing comma is 
appended to the Polish string to permit easier pattern 
matching. The translation is in the form of 4-tuples 
separated by '//'. The language is Log. 

in nan nie moi rl in ST i nil iii i a msi ea inacsntiaaamiaaa 


DEFINE ('TR (ARG) OP, N, P, T, ID,L1,1L2') 


Oe ee ne ee ee ee ee ee ee 
{ Pattern definitions: ITREE will match an integer tree. | 
{ RTREE will match a real tree. ] 
eta asics enlist ieee bei sei aaiilaemasianiical 


ITREE = ANY('4+-*/') 's* ANY('12') ',° *ITREE { 
+ ANY (‘IOKLMN') BREAK(',:! 1,8 | "REF:2,' *ITREE 

RTREE = ANY('+-*/') 's* ANY('12") ©, *RTREE | 
+ NOTANY (*IJKLMN') BREAK(',:') *,' | *REF:2,' *RTREE 

: (TR_END) 

Sr ee ee ge ee ee ee EE ne ere See ee eer eee ee ae 
| Entry point: if an operator, fan out; otherwise push the | 
{ leaf. { 
Nc cn dere ey ee ide ices atic eeu es sp a acta bint il eon reel 
TR POLISH POS(0) BREAK(':,") . OP ':' BREAK(',') . N 
+ ',t = 2S($('TR_' OP)) 

POLISH BREAK(‘,') .~ *PUSH() ',* = 2 (RETURN) 


rn ne 
{ Arithmetic operators. | 
i nik seein cop ud Shaan sma epi cimetidine cause 
TR_+ sTR~ 3TR_* 3TR_/ 

TR = EQ(N,1) TR() 'MNS,* POP() *,,' PUSH(TEMP()) '//* 


é :S (RETURN) 
'+ADD-SUB*MUL/DIV' OP LEN(3) . OP 
POLISH POS(0) ITREE :S(TR_1) 
OP = ‘Ft op 
TR_1 T = TR(Y) 
P = POP() 
TR= T TR() OP ',* P *,* POP() ',* PUSH(TEMP()) '//* 
+ : (RETURN) 


CL LE 
{ -Array references { 
a ea ee ernpserrerbeesieir ements moms encase en mii a en 


TR_REF POLISH BREAK(',*) . ID ',* = 


TR = TR() 
TOP() '¢! 2S (TR_REF1) 
PUSH(ID '(* POP() *) *) : (RETURN) 
TR_REF1 T = TEMP() 
| TR = TR ‘'ASGN,' POP() ',,' T ‘'//! 
PUSH(ID *(' T ty ?¢y : (RETURN) 


Relations are handled here. Note that ‘="' has been trans- 
lated by the TR_IF... processor to *EQ* to avoid ambiguity 
with assignment. An argument, ARG, contains the fail 
label. Success implies a no-op. Hence we need the com- 
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{ plement of the given operation. { 
erat rena Se eGR Se a env Sr oe OR a eee Oe eee SR Eee en Sse oe PT 
TR_> ;TR_>= 3;TR_< ;TR_<= ;TR_~= ;TR_EQ 

'EQNE ~=EQ <GE >LE <=GT >=LT'* OP LEN(2) . OP 


T = TR() 

P = POoP( 

TR = T TR() 'BRY OP ',* P ',* POP() ',' ARG '//! 
+ : (RETURN) 
Ce enn en a ee ee ee Ee ee ee ee ee 
{ Assignment t 
NNT Me SNE IR ee RN eae RL ea eS OST ae RI RP reel eve TENCE 
TR_= TR = TR() TR() ‘ASGN,' POP() ',,' POP() '//' 
+ ¢ (RETURN) 
aa Ra A SCE A Ga ae rR 
{ The IFts | 
Wi ince aterep nina etl ieee tee ei nie panies eminent 
TR_IFTHEN 
TR_IFELSE L1 = LABEL() 

POLISH POS(0) '=:2' = 'EQ:2! 

TR = TR(L1) TR() 

TR = EQ(N,2) TR ‘LBL,*' L1 '//* :S (RETURN) 

L2 = LABEL() 

TR = TR 'BR,,,' L2 '//* 
+ "LBL,' L1 '//*" TR() 'LBEL,' L2 '7/! : (RETURN) 
TR_END 


sn dee 
| LABEL() is like TEMP(). 1 
ence ie ama! 


DEFINE (‘LABEL () ') : (LABEL_END) 
LABEL LABEL_NO = LABEL_NO + 1 . 
LABEL = ‘LBL.' LABEL_NO : (RETURN) 
LABEL_END ’ 
Names_referenced Name Type Where_defined 
by_ TR: PUSH Function Program 5.5 
POP Function Program 5.6 
TOP Function Program 5.7 
TEMP Subfunction Program 18.2 
aa ae ai ae ete: | : 
Program TUPLE (OP, ARG1, ARG2,ARG3) will expand a 


t! | 

| 18.7 | 4-tuple (as described in Figure 18.5) into 
(| tt reasonably optimized machine code. It does 
Lae ed this by being ‘aware' at all times of the 
state of the registers and allocates and frees registers ac- 
cording to a primitive priority scheme. For example, the 
tuples produced (by POL and TR) for the two statements: 


X = X41 
IF X > Y THEN X = X + A(I+1) + Z 


are shown in Figure 18.6 together with the instructions 
generated by TUPLE. Note that spurious LOAD's and STORE's 
which were present in L_ONE are gone. TUPLE assumes that any 
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temporary variable (of the form TEMPn) is only referenced once 
and is not used across statement boundaries. 


{ FADD,X,1,TEMP1 { LOAD 1 | 
| { FADD 1,=1 { 
| { t 
{ ASGN,TEMP1,,X { STORE 1,X | 
1 | ! 
{ ERLE,X,Y,LBL.1 { SUB 1,Y { 
{ { BRLE 1,LBL.1 {| 
| { { 
{ ADD,I,1,TEMP2 | LOAD 1,T { 
{ t { 
| FADD,X,A(TEMP2) ,TEMP3 | LOAD 2,X 1 
{ | FADD 2,A(1) { 
| { { 
{ FADD, TEMP3,Z,TEMP4 { FADD 2,2 { 
{ { { 
{ ASGN,TEMP4,,X | STORE 2,X { 
| | { 
{ LBL,LBL. 1 {| LBL.1 { 


Figure 18.6 


The tuples produced by TR (on the left) and the 
corresponding code generated by TUPLE (on the 
right) for the statement sequence: X = XK + 1 3; 
IF X > Y THEN X = X + A(I+1) + Z. 


The register allocation schemes used in actual compilers seem 
to be ‘always messy'. TUPLE was written in a highly structured 
top-down fashion to avoid this. Note that the higher level 


routines have no notion at all of what the data structure to . 


associate registers with locations looks like. Only low-level, 
caretaker routines, know this. This is an example of 
‘information hiding' as advocated by Parnas [ 1972}. 


DEFINE ('TUPLE (OP, ARG 1, ARG2, ARG3) R') : (TUPLE_END) 
TUPLE 2($("TU_' OP)) 


TU_ADD ;TU_FADD ;TU_SUB ;TU_FSUB 
TU_MUL ;TU_FMUL ;TU_DIV ;TU_FDIV 
R = LOAD(ARG1) 
OUTPUT = ' ' OP ' ' Rt, ADDR(ARG2) 
DEASSOC (R) 
STORE (R, ARG3) : (RETURN) 


TU_ASGN R = LOAD(ARG1) 
STORE (R, ARG3) : (RETURN) 


TU_LMNS R = REG() 

OUTPUT = ' LOADN' R ',* ADDR(ARG1) 

STORE (R, ARG3) ': (RETURN) 
TU_BR  ARG3 = INDEX (ARG3) 

OUTPUT = ' BR ' ARG3 : (RETURN) 
TU_BRGT ;TU_BRGE ;TU_BRLT ;TU_BRLE ;TU_BREQ ;TU_BRNE 

R = LOAD(ARG1) 

OUTPUT = ' SUB ' R ',' ADDR(ARG2) 

FREE (R) 

OUTPUT = ' © OP ' '§ R ',# ARG3 : (RETURN) 
TU_LBL OUTPUT = ARG1 

REG LIST = '(,° : (RETURN) 
TUPLE_END 


{ LOAD (LOC) will load the indicated location (if not already | 
{ loaded) into a register and return the register. { 
Pe a ee A ee 8 Da re aS ME ce Se fear PAPO oe rete ae ae Ion 


. DEFINE (* LOAD (LOC) ') : (LOAD_END) 
LOAD LOAD =. ISREG (LOC) 2S (RETURN) 

LOC = ADDR(LOC) 

LOAD = REG() 

ASSOC (LOC, LOAD) 

OUTPUT = ' LOAD ' LOAD ',* LOC : (RETURN) 
LOAD_END 


a ee ep ne ea ge hyn pT EO TE Te Me age Leagan en oT Ee aN Cg Gee VSG Le ae ne 
{ STORE (REG,LOC) is a generalized store operation storing a | 
| given register REG into a given location LOC updating the | 
{ register assignment list. | 
haere sere ore ee ep nsf srs theses retusa euacenceral 

DEFINE ("STORE (REG, LOC) *) . : (STORE_END) 
STORE Loc = INDEX (LOC) 

FREE (REG) 

ASSOC (LOC, REG) 

Loc TEMP_LOC 3S (RETURN) 

OUTPUT = ' STORE ' REG ',' LOC 3: (RETURN) 
STORE_END 


ADDR(LOC) will return a usable address designating the { 
possibly subscripted location LOC. The address returned | 
will be a register number if LOC is contained in a | 
register. If Loc is subscripted, a register number | 
replaces the subscript. If LOC is a constant, the symbol | 
'=' is prepended. { 
hee ne 


DEFINE (* ADDR (LOC) ') -: (ADDR_END) 
ADDR ADDR = Loc : 
ADDR = INDEX(ADDR) 
ADDR = ISREG (ADDR) :S (RETURN) 
ADDR POS(0) SPAN(*0123456789') RPOS(0) = 
+ ‘= ADDR : (RETURN) 


ADDR_END 
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| INDEX (LOC) will load the subscript (if any) of the given |{ 
| location into a register and return the same expression {| 


{ with the index replaced by a constant. { 
DEFINE (' INDEX (LOC) S‘) ; : (INDEX_END) 
INDEX INDEX = LOC 
INDEX ‘'(' BREAK(')') . S = ‘'f#(* LOAD(S) : (RETURN) 
INDEX_ END 


The following five functions are low-level basic routines 
used tO associate registers with locations. A string of 
register-location pairs is kept in the order of increasing 
priority in REG_LIST. If a register is associated with a 
location then the value normally found at that location 
will be in the register. Also, if the location is a tem- 
porary, the location will not contain that value; other- 
wise the location will also contain the value. 


DEFINE (*REG() LOC‘) 
DEFINE ('FREE (REG) ') 
DEFINE (* ISRFG (LOC) ‘) 
DEFINE ("ASSOC (LOC, REG) *) 
DEFINE (' DEASSOC (REG) ') 


NO_REGS = 16 
REG_LIST = ',! 
TEMP_LOC = POS(0) 'TEMP* SPAN('0123456789*) RPOS (0) 


: (REG_END) ; 


oe re Pa Oe ee Ge ee Ne Cee ep ee eS Ge eT gen TT MP Rene cee ee ae 
{ REG() will return an available register. If all registers | 
{ are associated with locations, it will free up the { 
| register with the lowest priority. { 
Necessities tiie hi erin ani ens caonisatainieasisi 


REG REG = LT(REG,NO_REGS) REG + 1 °F (REG_1) 
REG_LIST "(" REG ')! : F (RETURN) S (REG) 
REG_1 REG LIST ',* BREAK('(') . Loc '¢' 
+ BREAK(")*') . REG ')! = ‘ie 
LOC TEMP_LOC : F (RETURN) 
OUTPUT = * STORE ' REG *,* Loc : (RETURN) 
Ce ato” TPR re Ea ee ae ee eg a es OT eT TS SE ee ee Wa ee ee 
| FREE (REG) will free a register for other associations. t 


as esnsnsene- cational osseous arm hts ise snes ties sn SSDS Sn ncENS/emeasiniassenaammnssseneaseall 
FREE REG_LIST ',' BREAK('(') '(* REG ")" = : (RETURN) 


rr a a ee ee ee ne ee ete ee ee ye ee ee 
{ ISREG(LOC) is a predicate which will determine if LOC is | 
{| currently associated with a register. If so it will boost | 
{ its priority. { 
sce crass inp etwas imrireohsnsstcesssmsspilamiisal nauseated asain ccsiesiaocmemnesiinans 
ISREG REG_LIST ',* LOC '(* BREAK(')*) . ISREG ')' = 

+ : F (FRETURN) 

REG_LIST = REG LIST LOC '(' ISREG '),* 3: (RETURN) 

eR ee ee ee ee ee 
{| ASSOC(LOC,REG) will associate an unsubscripted location | 
{ with a register. . ( 
i ap ecm Senne =n sr nu serene itp estate tisiseimnatinvsiea ransom 
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Assoc Loc '(! :S (RETURN) 
REG_LIST = REG_LIST LOC '(* REG '),* —: (RETURN) 


DA eR AS RT IE SEE a UR NS ST EGGS | 
{| DEASSOC (REG) will remove any association a register has | 
| with a location but will not free the register. | 
Wha cise neil eee Siriaas cee iin soaeoaamisaoinaieaiele 
DEASSOC REG_LIST ',* BREAK('(') *(' REG 'j)! = 

+ ',(" REG ')! : (RETURN) 
REG_END 


Epilogue 


Note that a distinction is made between a register which is 
free and one which is merely disassociated. This distinction 
is necessary because when a register is about to be stored it 
is not yet free (for use as an index register for example) and 
yet it may unrelated to any given variable. Note also that 
although a register could theoretically be associated with two 
different location (such as after A = Be TUPLE allows only 
one such association. 


No distinction is made between fixed and floating point 
operands of the relational operators. We are here assuming 
that floating numbers operate on the same equality scale as 
integers (a common case). 


11 it A macro system is basically a method whereby 
{ft 18.8 " the user of the system may define and employ 
(1 {{ abbreviations. GPM stands for General Pur- 
t—-—_______—__J pose Macro processor and was developed by 
Strachey [1965]. GPM is general purpose in two ways; it can 
be employed as a preprocessor for an arbitrary language and it 
can produce arbitrary string computations. 


Macros first grew into prominence with the development of as- 
semblers. Initially they were mere abbreviations for instruc- 
tion sequences but soon grew more sophisticated with the 
introduction of arguments, conditional assembly instructions, 
repeat and sequencing facilities. Macros were able to define 
other macros and redefine themselves. McIlroy [1960] describes 
many of these techniques. 


It was soon realized that a complete computational facility 
could be implemented relatively easily based on little more 
than the ability to define a macro and GPM was one of the 
first complete languages to be based on a macro system. But 
whereas GPM is complete, as we shall see later, one must al- 
most stand on one's computational head to perform certain com- 
mon operations (e.g., see Exers. 18.25 and 18.27). 


We will write GPM as a function GPM(S) which will return a 
translation of string S. If S does not contain either of the 
two special characters '#' or '<', it will be returned intact. 
A sequence of the form: 


#name, argy, AXJoay oo- » AYGn3 

is considered to be a macro call. Macro calls within the 
string S will be replaced by an evaluation. Every macro call 
returns a string (which is possibly null). This returned 
string is again passed through GPM by a recursive call to ob- 
tain the macro's evaluation. 

The built-in macro DEF allows macros to be defined. 

#DEF,name, pr; 


will define a macro by the given name and associate it with a 
prototype pr. It returns the null string. For example, 


#DEF,M, STRING; 


will define a macro M whose prototype is 'STRING'. When M is 
called as in: 


#M3 
the value returned is 'STRING'. Hence: 
GPM ('#DEF,M, STRING; x#M;y') 
will return 'xSTRINGy'. 
In some respects, the DEF function may be thought of as as- 
signing a string to a name. But a macro may also have argu- 
ments which may be embedded within the prototype. The position 
of the first, second, third, etc. argument is indicated by the 
position of the symbols 61, 6&2, &3, etc. Thus: 
#DEF , SQUARE, & 1*6& 1; 

defines the macro SQUARE with one argument. The macro call: 

#SQUARE, (X+Y) ; 
returns ¢ (X+Y) * (X+4Y)'. Within the argument list of a macro 
call there may be other macro calls and these are evaluated to 
obtain the actual arguments. For example, 

#SQUARE, #M3Y; 
returns 'STRINGY*STRINGY'. The macro call may be suppressed 
by surrounding a string with pointed brackets. Thus 
GPM ("AA<#>AA') returns ‘AA#AA', Pointed brackets are stipped 
off in pairs. Thus, GPM('A<B<C>D>E') returns 'AB<C>DE'. Poin- 
ted brackets may be used to defer evaluation of macro calls 
until some later time. Thus 


#DEF,A,<#M3>3_ 
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will associate with A the prototype '#M;'. When A is called 
as in #A; the returned string is evaluated leading to a call 
on #M; which returns 'STRING'. 


Were the returned values merely substituted for the macro call 
without again being evaluated, the macro system we have 
described so far would only be useful as a system of forming 


abbreviations. But by the simple act of reevaluating the re-. =. 


turned value, we obtain a general purpose computational 
language, a language capable of expressing anything com 
putable. This is a remarkable fact. To see that this is so, 
consider defining a conditional macro #COND,X,Y,%Z; which 
evaluates to Z if X equals Y and evaluates to null otherwise. 
On the one hand, if the returned string were not reevaluated 
it would be impossible to write COND (should it be written as 
the null string or as &3 ?) and hence GPM would not be com 
pletely general. On the other hand, a conditional allows one 
to simulate a Turing machine and hence perform arbitrary com- 
putations. To see this reflect that a state-transition table 
(as in a Turirg machine) may be implemented as a collection of 
conditionals (one for every combination of states and inputs). 


We may write #COND,X,Y,2Z; as: 
#DEF , COND, <#DEF,<& 1>, 3 #DEF, <&2>, <&3>3#<E1>3>; 


In the above, the first argument is defined as a macro which 
evaluates to null. The second argument is also defined as a 
macro and this definition overrides the first if and only if 
the first two arguments are equal (a macro name need not be an 
identifier but may be any string of symbols). Finally, the 
macro named by the first argument is called. The returned 
value is the third argument if the second definition overrode 
the first. Programming in this language is opaque but is per- 
fectly general. If the argument to GPM is not well-formed, 
meaning that if a '#"' is not followed by a corresponding ';! 
or that a '<' is not followed by a corresponding '>', GPM will 
fail. This fact can be used to apply GPM to a program without 
reading it into main storage in its entirety. Only a suf- 
ficient amount of it need be read to enable GPM to _ succeed. 
Said another way, if GPM(S1) succeeds then GPM(S1) GPM(S2) 
equals GPM(S1 S2). 


There is one point in which the implementation given departs 
from official GPM as defined by Strachey. Macro definitions 
here are global and not local to the evaluation of a specific 
macro. Assume the following definition occurs. 


#DEF,X,Initialization <#DEF,X,Action; #X; >; 


In our system, #X; will evaluate to "Initialization Action!' 
on the first call and to ‘Action' on all subsequent. calls. 
This is because the macro X redéfines itself. In Strachey's 
system the macro definitions are pushed so that when return is 
made to the outer level the original definitions remain in- 
tact. Hence a macro could not redefine itself. There are 
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advantages and disadvantages to both. As a computation tool, 
Strachey's system is perhaps superior since macro names can 
serve as temporary variables. For a practical macro processor, 
however, it is better to have global macro names. 


DEFINE ('GPM(S) PREFIX, BOD, ARG, NAME, N, PUSH_POP") 


Initialization section for GPM: FORB_CH (forbidden charac- | 
ter) is assigned a character not permitted in the source | 
string. GPM_BAL is assigned a pattern which will match a | 
string balanced in the GPM sense. Note that although <> | 
and #; both serve as a kind of parenthesis they are not {| 
symmetric. { 
(aceon est srt ss s/s fs ls sPfPs--ssnosrssnafvaneneseemnsireeanasamnasmmnanssnsoessll 


SALPHABET LFN(1) . FORB_CH 


MAC_TBL = TABLE() 

ITEM = '<* BAL('<>*'y ">! | '#* *GPM_BAL '3! 
+ | NOTANY('<#"') BREAK('<#>;,') 

GPM_BAL = ARBNO(ITEM) 


SSS ee ent 
{ This is the basic pattern used to process strings. PREFIX | 
| is the string up to a macro call or a <...> literal. BOD | 
{ will be either the literal body or the result of | 


| evaluating the macro 1 

| eS 
GET.PREFIX.BOD = POS(0) BREAK('<#') . PREFIX FENCE 

+ ('<* BAL('<>*') . BOD '>! 

+ '#* GPM_BAL . NAME . *PROC('NAME‘) 

+ ARBNO(',' GPM_BAL . ARG . *PROC('ARG')) 

+ 's* . *PROC('MEND') = ) { 

+ REM . PREFIX NULL . BOD : (GPM_END) 


re ae ee ey RE TG ee EE fee Fe gh Pe eg Le ET ge a en ae a 
{ Entry point: { 
a a rt te ee er a sneer anwnaninscnceeel 


GPM IDENT (S) :S (RETURN) 
S GET.PREFIX.BOD = . : F (FRETURN) 
GPM = GPM PREFIX BOD : (GPM) 

GPM_END : 


Ce ae a ee Se ee EE ef Pa See Pek ee ee we eee ee gee hee eae 
{ The routine PROC will process macro names (at PNAME) macro | 
{ arguments (at PARG), and macro terminations (at PMEND). | 
asin tren pm ltl atti a amish cies tiie acting amnsipnoagtnnicaicitanneio. 


DEFINE ("PROC (TYPE) *) : (PROC_END) 
PROC PROC = .DUMMY 2($('P" TYPE)) 
PNAME NAME = GPM(NAME) 

N = 0 

PUSH_POP = . 

PUSH (NAME) : (NRETURN) 
PARG PUSH (GPM (ARG) ) 

N = N#1 : (NRETURN) 
PMEND BOD = IDENT(NAME,'DEF') POP() :F (PMEND_2) 

MAC_TBL<POP()> = BOD 


BOD = : (NRETURN) 
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PMEND_2 BOD = REPLACE (MAC_TBL<NAME>, '&*, FORB_CH) 


PMEND_1 BOD FORB_CH N = TOP() :S (PMEND_ 1) 
N = N- 1 
POP () : :S (PMEND_1) 
BOD = GPM(BOD) : (NRETURN) 
PROC_END 
Names referenced Name Type Where defined 
by GPM: BAL * Function Program 8.3 
PUSH Function Program 5.5 
POP Function Program 5.6 


* indicates name is referenced in the initialization section. 


x a a 4 
ee 22222 22227222222772? 222? 


Ses sees eeseseecas ess ess sseese sees sese ses sses ts esuesteseeseveesses es eesees sed 


So ee 

| Exercise 18.1 { Suggest a method (or methods) whereby the 
t_____________§ OPS and SYMS tables of ASM (Prog. 18.1) can 
‘be made smaller at the expense of time. Implement one of your 
plans. 


So roy 
{ Exercise 18.2 | Add expressions to ASM (binary +, -, * and 
tL—_____________§ / and unary -) by modifying the semantic 
routines of L_ONE for the purpose. Let the period (.) mean 
the current address. 


SS Ses 
| Exercise 18.3 {| Assuming there are eight bits per charac- 
L_-—_______________-J. ter, how would you modify ASM to output (on 
the PUNCH file) a 32-bit word as four characters. 


oS oe ee : 
{ Exercise 18.4 { Modify ASM to allow symbols of the form 


t________________.J.  =<€constant>. For example, =37 implies the 
address of the constant 37. (This convention was actually as- 
sumed by TUPLE, Prog. 18.7.) Be sure to avoid generating 
duplicate constants. All such literals should be placed after 
the last instruction of the program being assembled. 


SSS 

{ Exercise 18.5 {| What character is not permitted in the ar- 
L___________.____4 gument to S(name), the semantic subfunction 
of L_ONE, Prog. 18.2? How can S(name) be modified to avoid 
this restriction? 


CO rn enenny : h 
{ Exercise 18.6 | Augment Language L, (Figure 18.2) by al- 
____-____..__§. lowing subscripted expressions. Modify 


L_ONE accordingly. 


Se eR A TERE ED STS EE ae SEP CATED SD OAS AOSD 


Qt oe Cao og ee : 

{| Exercise 18.7 { Identifiers seen by L_ONE are passed on to 
L.-J the assembler untouched. This is not always 
desirable. Modify L_ONE so that each identifier is replaced 


by a unique ‘internal! name. 


Ce ee ee 

{ Exercise 18.8 | Extend L_ONE to handle real arithmetic. An 
t________-____—_1 identifier is assumed to be integer or real 
(floating point) depending on whether or not it begins with 
one Of the letters ‘IJKLMN'. Allow mixed expressions both in 
binary operations and across an assignment. Assume two ad- 
ditional instructions for machine M, viz. CIR which converts 
from integer to real (loading into the forget register) and 
CRI which converts from real to integer. 


CS eee 
{ Exercise 18.9 | Write a program which will read in a BNF 
__—__-______-——J_ grammar and produce for each syntactic 
variable <v> a pattern named V that will match it. Assume 
there are no extraneous blanks. (This requires. about eight 


instructions.) 


CS ee eee 

| Exercise 18.10 | It has been observed that well over half 
t--—~_-_-__--__---3 of all Fortran programs appearing on 
listings dumped into a certain trash can contain no interior 
blanks. Use this observation to improve the speed of blanks. 


NS ee ee ee 
{| Exercise 18.11 | If BLINT (a pattern in BLANKS, Prog. 18.3) 


t__________-_-_s is simplified to SPAN(NUM ' ') then BLANKS 
will operate incorrectly in some cases. Furnish such a case. 


oe ee ae 

| Exercise 18.12 {| A squemish programmer, wishing to avoid 
t—__________-___..3 left-recursion writes, for the definition 
of E (a pattern in POL, Prog. 18.4): 


E = TERM ADDOP *E POL(2) { TERM 


What error has been introduced? Give an example of a statement 
which would yield incorrect results. 


Se ee ee 
{ Exercise 18.13 { Modify POL so that a null statement is al- 
_ ——--———————1 lowed. This would permit, for example, 


the sequence: 
IF A=1 THEN ELSE X = 2 
| i ia ae maa | 
{ Exercise 18.14 {| Modify POL, Prog. 18.4, to allow IF ... 


_______________J) THEN ... ELSE type expressions. An example 
is: 


A = IF A> 0O THEN 1 ELSE -1 


Transform this syntax into Polish using a 3-ary operator 
called EIF (Expression IF). 


LEAT CEE 
| Exercise 18.15 {| This exercise indicates how error messages 
L__________———---J_—- may be incorporated into POL(). Write a 
function DNF(S1,S2) (Did Not Follow) which will form the 
message: 


A valid ... S1... was encountered but 
this was not followed by a valid ... S2 ... 


This is to be appended onto a glokal error message string 
(MESSAGE) which is printed if the statement cannot be matched. 
Using DNF, modify the patterns of POL, Prog. 18.4, to issue 
error messages in the following cases: (1) an expression 
doesn't follow an '=' in assignment, (2) a Boolean doesn't 
follow an IF, (3) a statement doesn't follow a *THEN', (4) a 
primary doesn't follow a unary minus, (5) an expression 
doesn't follow a *('. 


Conn ee ene a ee 

| Exercise 18.16 | This exercise indicates how SNOBOL4 pat- 
_______——-_----J¥_—-s tern matching can ke used on the inter- 
mediate form to achieve a degree of machine-independent code 
optimization. Scan a Polish string (as output by POL, but with 


a trailing comma) for a pattern which resulted from an assign- 
ment of the form 
<VAR> = <VAR> + <E> 


where <VAR> is the same (possibly sukscripted) variable. 
Transform this into the 2-ary form: 


AUG: 2,<VAR>,<E> 
Do the same for an assignment in which the <E> is the first 


operand. 


eee ee eee 

{ Exercise 18.17 | Write a pattern to match an arbitrary tree 
nnn «6With no «uupper limit on the number of 
leaves. 


a a a eee | 

{ Exercise 18.18 {| Modify TREE to accept N additional argqu- 

to be associated with the various leaves of the tree. Thus 
TREE(*¢*, 2, .~NAME1, .NAME2) 

will return, in effect, 


$4224 ARB_TREE . NAME1 ARE_TREE . NAME2 
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To do this exercise, you must assume some maximum N (already 
assumed anyway in the coding of TREE). For extra credit, make 
your program entirely dependent on the parameter MAX_N. 


Get te ee 

| Exercise 18.19 | In POL, Prog 18.4, argument lists were 
J compiled into a Polish notation having the 
form: 


Use pattern matching to convert this into the form: 


COMMA: nN, aXgGyeAXGoy aca 


aa aS ara ere aaa | ; 

{ Exercise 18.20 { Modify TR, Prog. 18.6, to handle mixed ex- 
U——__-_______-I._ pressions, both in the binary arithmetic 
operations and relations and across assignments. Assume tuples 


CVTIR,Arg,,,Arg3 
CVTRI,Arg,,,Arg3 


exist to convert from integer to real and real to integer 
respectively. 


Cae re a ee 

| Exercise 18.21 { The following exercise extends TR (Prog. 
tL————_______--J 18.6) to include functions. Assume that 
the tuples required for output for the function reference: 


FUNC (Argy, Argay --+ « AYGn) 
are 


ARG, Arg, 
ARG, Argo 
CALL, FUNC, , RES 


where RES is the location in which the result is deposited. 
Assume that the function ATEST(ID) exists which is a predicate 
to determine whether ID is an array. If ID is not an array, 
it must be a function. 


Cr Pr ee ee 
{| Exercise 18.22 | Modify L_ONE to call TUPLE rather than 
—______________J producing unoptimized code. 


Ce ee ee ee ee 

{ Exercise 18.23 { TUPLE (Prog. 18.7) is stupid in not op- 
t—______________._J timizing the case where the 2nd argument 
is already in a register and the first argument is not and the 
Operation is (F)ADD or (F) MUL. Modify TUPLE to handle this. 
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SSS 

| Exercise 18.24 {| The action taken by TUPLE for a label is 
t_______..___-J rather ruthless (removing all previous 
register associations). For labels generated as a result of 


IF processing, only those symbols need by disassociated that 
are actually modified by one of the clauses. Write a routine 
that will scan the output of TR to determine which symbols are 
modified and arrange to have only these disassociated when 
IF-type labels are encountered. 


a a aE IT REE | 
| Exercise 18.25 {| The following formula from Strachey [1965] 
L__-_———_---——.—-!_ defines a macro S with one argument. 


#DEF,S,<#1,2,3,4,5,6,7,8,9, 10, #DEF, 1,<&>&13 5>3 


What is the result of (a) #5,2; (b) #S5,5; (c) In words, what 
does S do? 


So a ee ee ee 

{| Exercise 18.26 | Modify ASM so that it uses GPM as a macro 
_-_____________-J. processor. Allow macro prototypes to con- 
tain more than one line. This can be done by encoding line 
boundaries as a special character sequence. 


ema rae | 

| Exercise 18.27 {| It is sometimes required to build up a 
i__________..._.--J large string at assembly time. Write a 
macro #CS,S; (Concatenate String) such that when #S; is called 
all the strings so far passed to CS will be returne 
concatenated together. 


| oem, Ee oes ee | ara ee oreo”: oars On oe 
Ir Irri Il Hott eae Ge tent UN tee 
— Ul tf Ut Ht ott tt ee | 
Se ee 
— 1st tee tet th FY Et 
te LN ON LS ust tw J tu ts CF L} 

FOR ODD-NUMBERED EXERCISES 
Ssrssssssscssssssssscsss Solutions SSsssssersrssssssse== 
Sriseaersssssssersssss= for Seessecerssessslsssse== 
ssoe tescsersscssesesee2e2e== Chapter 2 tesescssssrsossecrs=se2r=== 
2.1 The body of the function UP(ARG) is 
UP UP = REPLACE (ARG,LOWERS_,UPPERS_) : (RETURN) 
2.3 
L P (POS(0) (SPAN(' ') | "yp yt. "yp. 

+ ANY (UPPERS_) . C = T  UPLO(C) 2S (L) 

P = UPLO(P) 
2:35 

SIZE (BASEB (K, 2) ) 

SIZE (BASEB (K,n) ) 
227 

DEFINE ('V (ARG) B,S,E,F') : (V_END) 
Vv B = BASEB(BASE10 (ARG, 16) ,2) 

B LEN(1) . S LEN(10) . E REM. F 

Vv = (-1) ** S CONVERT(BASE10(F,2),'REAL') * 
+ 2 ** (BASE10(E, 2) - 1045) 

: (RETURN) 

V_END : 


2-9 Those involving built-in numerical operators: EQ, REMDR, 
4, * and + (four statements in all). 


2-11 Initialize H with '01234567'; then replace all 16's by 
8's and replace all HEX's by OCT's. 
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week) is equal to the DAY of the first, second or third of the 
following. month, the day is invalid. 


2.15 M = CEIL({5 * D - 150) 4 153.) (See the chapter on 
arithmetic for an analysis of this); then take the number of 
days and subtract off 31+#28 (or 31+29 in a leap year); if this 
number is negative, add the number of days in the year (365 or 
366). Use the formula above to determine M. Then REMDR(M + 
2, 12) + 1 ais the month. 


2-17 Insert a test and branch at the entry point of SPELL and 
insert a section of code labeled SPEIL_LONG as follows: 


SPELL LE (SIZE(N) ,6) : F (SPELL_LONG) 
SPELL_LONG. N  RTAB(6) . M = 
SPELL = SPELL (M) 


SPELL ‘'SEPT' = ‘oct? 
SPELL ‘SEXT' = ‘SEPT! 
SPELL ‘QUINT! = 'SEXT* 
SPELL ‘QUADR' = ‘QUINT! 
SPELL 'TR* = ‘'QUADRt 
SPELL 'B' = ‘TR 
SPELL 'M' = 'B! 


SPELL = SPELL ‘' MILLION* | 
SPELL = NE(N,0) SPELL ' * SPELL(N)  : (RETURN) 


2219 bene, 

: ‘#C4D#EF #GGHA#B' ‘TAB(N) NOTANY(*#") . NOTE | 
+ TAB(N - 1) LEN(2) . NOTE 
Seassesessssessseer=ec= Solutions maessrseressssssescsesese 
Srrssssssrtssssersesere= for Sasser esssessssssscss 
Sassseeerece se fs-F>-== Chapter 3 Seeeesssessscsssssescss 
SRPSSISSOSTS SS Ss SS SSS SSS SASS SSS SS SH SSS SS Sse Ss SsssSSs2Sesseesescrrze= 


3,1 RPAD(S,N,C) = REVERSE (LPAD (REVERSE (S) .N,C) 


3.3 CENTER(S,N,C) = RPAD(LPAD(S, (N - SIZE(S)) 7/7 2,C),N,C) 


3.5 (a) REPLACE('CXCB','BBCD!,S) ; (b) 4 
3e7 (a) 

: DEFINE ('TPOS(S,H,W) K,C*) 2 (TPOS_END) 
TPOS S  POS(K) LEN(1) . Cc 2F (TPOS_1) 

TPOS = TPOS C 

: K = K¢ew : (TPOS) 

TPOS_1 GE(SIZE(TPOS), H * W) : :S (RETURN) 
- K .= REMDR(K,W) + 1 | 2 (TPOS) 

TPOS_END : 

(b) 

| SALPHABET LEN(H * W) . S1 


S2 = TPOS(S1) 
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DEFINE (' ENCODE (S) T') 
SALPHABET LEN(H * W) . S1 


PS1 = TPOS(S1,H,W) : (ENCODE_END) 
ENCODE S LEN(H * W).T = :F (ENCODE_1) 

ENCODE = ENCODE REPLACE(PS1, S1, T) |: (ENCODE) 
ENCODE_1 

S = S DUPL(':', H * W- SIZE(S)) 

ENCODE = ENCODE REPLACE (PS1,S1,S) 

ENCODE = DIFF(ENCODE| ':') : (RETURN) 
ENCODE_END 


3.9 Do a positional transformation to obtain the odd charac- 
ters in the string (H1). Then do a similar transformation to 
obtain the even characters (H2). Transliterate H1 so that 
digit k goes to the (16 * k)th character of S&ALPHABET. Trans- 
literate H2 so that digit K goes to the Kth character. Then 
OR the resulting strings. - 


3.11 '00112233445566778899' 

3.13 IDENT(SKIM(S) ,S) 

3215 (a) 

REVERSE (REPLACE (TRIM (REPLACE (REVERSE (S) ,'0',* ")),' ",*0*)) 
(b) +S 

3.17 SWAP, SWAP_ARG1 and SWAP_ARG2 

3.19 a-ht, b-ht, d-h 


3.21 (XY) X.Y Y¥.X 


ee re ee es ee ame ee ee ae rae re a a se a ae ee re ee ee ae ee ee ee a ae a a a a ee eee 
i I i i 


Sirtsseesssssssssss=es= Solutions Smacetessessssesssse= 
sitters sss2sssesesse== for sa ascestssstsesesses ess 
Srsteressssseessess sess Chapter 4 S252 5Ss=52==S225==== 


dee ee ee Pe oe ee ee er et ee ee 
i i 


4.1 M = CRACK ('JAN.,FEB.,MARCH,APRIL,..0', ',') 


4.3 (a) opposite pairs are swapped twice resulting in a 


mutual cancellation. A remains unchanged, I is set to N + 1. 
(b) SEQ(*' J =N+1- 1 3 (GT(J,I) SWAP(.A<I>,.A<J>))',.I) 


4.5 SEQ(" A<I> POS(0) NOTANY('M') “", .T) 
-7 It is equivalent to AOPA(A1,* ', A2) 

-9 STRINGOUT( AOPA (CRACK (X),' ',CRACK(Y)) ) 
211 A<FIND(A,'-LGT') > 


4.13 A practical version of the following function would use 
'funny' names for temporaries and parameters. 
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DEFINE("DO(S,NeL,U,I) *) 


| : (DO_END) 
BO S = CODE(S * 3; s: (DO_1)*) : F (FRETURN) 
$N = L 2<s> 
DO_1- $y = §N + I . . 
LE ($N, U) 3 S<S>F (RETURN) 
DO_END: | | 
“4,15 . 
DEFINE (* PUSH (A, E) *) : (PUSH_END) 
- PUSH PUSH = A , 
A<1> = A<1> # 1 
PUSH_1 A<A<1>> = E. :S (RETURN) 
‘A = CATA(A,A) | 
PUSH_END co a , 
 gaeetesesescesssssesrsz Solutions Sascscrtsserssssss2se2=2= 
| Stmseeessseeseescesscess for Seescssssscssseses2eere== 
Suerssrses2e2rsersssces Chapter 5 Sesrsssrssseesssses== 
5-1 
_ DEFINE ("CRACK (S,B) N,V, PAT*) : (CRACK_END) 
CRACK IDENT (B, NULL) ; :S (CRACK_1) 
Ss RTAB(1) B ABORT | REM. S = S8B 
PAT =  BREAK(B) . V_ LEN(1) 
‘CRACK_2 S PAT = 3: F (RETURN) 
$N = LINK(,V) 
'  N = NEXT (SN). 2 (CRACK_ 2) 
CRACK_1 PAT = LEN(1) .V : (CRACK_ 2) 
CRACK_END 
S23 (a) 
IDENT (PUSH_ POP). +S (FRETURN) 


NM = .PUSH_POP 
DIFFER (NEXT ($NM))  .NEXT(S$NM) :S(FIRST_1) 
VALUE ($NM) 


"yj 
Ln! 
we 
” 
3 
Hou at 


$NM : (RETURN) 
(b) Use a doubly-linked list as in Ex. 5.2. 
Se No modification to REVL is required. 
.7 : 
DEFINE (*IFFLD(N,S) I,F*") : (IFFLD_END) 
IFFLD F = FIELD(DATATYPE(S),I +# 1) : F (FRETURN) 
. I. DIFFER(F,N) I + 1 . 3S (IFFLD) F (RETURN) 
IFFLD_END 


5.9 (1) Insert the four characters ',NEW' behind *MARKt in 

“the DATA function. (2) Use the constant 2 rather than 1 in 

FIEID. (3) The third statement after VISIT_1 should read: 
FLD (SON, I) = GT(..-) NEW(GS) :S(VISIT_1) 
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(4) Change VISIT_2 to: 


‘VISIT_2 NEW(SON) = COPY (SON) s; SCN = NEW(SON) 

(5) Return the copied configuration by modifying VISIT_3 to: 
VISIT_3 VISIT = IDENT(FATHER) SON : S(RETURN) 

bp th ES SS > Sh SS 8 
Sesssssrsssesssssesee=re= Solutions Srerrstssssseeeessrecz=2 
SSS SSS SSH SSH SS SSS SSSS SR SSS PAS SSS SS SS HSS SSS TAS SSS SHAS SSS SSSHSSE= 


6,1 a-F, b-T, c-F, d<-F, e-T, f£-T, g-F, h-T, i-T, j-T. 


6.3 The canonical form is "BED" |{ *BEDS* { "BEAD® | "BEADS* | 
*RED* {| *REDS* { *READ*® | *READS*. The pattern is not monic. 


6.5 a-Y, b-N, c-Y, d-Y, e-Y, f-Y, g-Y, h-Y, i-N, j-N. 


eee 


NULL { NULL {| NULL { NULL { NULL { ... 
2-9 (L2#3L+2) /2 
6.11 NNA2 ** LT 
-13 a-Y, b-N, c-N, d-N, e-Y, £-N, g-Y, h-Y. 
6.15 Na) (0, 2] bp £0, 2, 4 4] Cc) 2**K 


6.17 ARBNO(*AA® { A‘) will match all even-length sequences 
of Ats before matching odd sequences. 


6.19 P, = FENCE "ABC', Po = FENCE *XYZ*. 


Se 


6.21 
a) RPOS(0) {| BREAK(S) SUCCEED 
b) ANY(S) 
c) ANY(S) {| BREAK(S) ANY(S) SUCCEED 
d) POS(N) SUCCEED { TAB(N) 
e) P = TAB(N) { RTAB(N) TAB(N) SUCCEED { RTAB(N) X 


ee ee ae a ss Se a ee a ea a ie a ee wee ee 


SxeSscsssssssrsssse sess rcs Solutions Sess Sse ssossessssssassccs 

sSertrressesssescs=ssess== for Sesser srsssseosressee= 

Ssressrsssssssesftsssserccre Chapter 7 Stsees2tserssestsee2=se2e2= 

7.1 

BREAKP C = CURSOR 

BREAKP. 1 SUBJECT POS(CURSOR) ANY (ARG(NODE) ) 2S(S) 
CURSOR = GE(CURSCR, LENGTH) Cc 2S (F) 
CURSOR = CURSOR + 1 : (BREAKP. 1) 

Full credit if LF is used instead of F; half credit if MF is 

used. If the pattern match and test are inverted, take 3/4 


credit. 
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7.3 (a) 2** N (b) (4 **# N+ 2) 73 


7.5 To form a loop of alternates by alternation or a loop of 
subsequents by concatenation would require that the loop go 
through the root of the ‘second argument since this is the only 
kind of arrow added by these operations. But since the second 
argument does not impinge on the first, no loop can be formed. 
-If a loop was formed via ARBNO(P) it must go through P. But 
it could not be a loop of alternates since only solid arrows 
are added out of P and it could not be a loop of subsequents 
because only a dotted arrow enters P. : 


727  a-9, b-20, c-40, d-14, e-1, £-7 


7.9 a-~Yes, b-Yes, c-No, d-Yes 


7.11 Design TAB(N) .as a compound consisting of a node TAB1 
and an alternate TAB2. TAB1 pushes the futility flag, TAB2 
restores it and fails. , 


7.13 
ARBN1 PUSH (FUTILITY) 
FUTILITY = 1 :(S) 
ARBN2 FUTILITY = EQ(FUTILITY,1) EQ(&FULLSCAN,0) POP() :S (LF) 
POP() . > (S) 


7.15 Create a compound similar to Figure 7.8 with NOT1, NOT1B 
and NOT2 in place of VA1, VAB1 and VA2 and with no VAB2. NOT, 
like VA1, pushes a nonnegative value onto Stack Alpha. NOT2 
changes this to a negative value and fails. NOT1B (NOT1 on 
Backup) pops the value and _ succeeds or fails depending on 
whether the value is positive or negative. 


7.17 Call the root node r. Then 
D(r) = D(s) { LEN(1) D(r) { Da) 
Since D(r) is supposed to equal ARB D(s) { D(a) we may plug 
this trial value into the right hand side and after some 
Manipulation we obtain 
ARB D(s) {| LEN(1) D(a) {= D(a) 
which does not equal the trial value. 


7-19 
SCAN IDENT (ALT (NODE) ) : ($PROG (NODE) ) 
PUSH(NODE) ; PUSH(CURSOR) 
NODE = ALT(NODE) : (SCAN) 
Ss NODE = SUBS(NODE) 
IDENT (NODE) | :S (RETURN) F (SCAN) 
F - CURSOR = POP() 3 NODE = POP() 


IDENT (NODE) :S(FRETURN) F (S$PROG (NODE) ) 
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8.1 ARBNO(NOTANY (S)) RPOS(0) |{ BREAK(S) 
-3 Replace calls to BREAK by calls to BREAKREM. 
8.5 3,4,5,6 


8. When NAME is converted to expression the result is not 


-9 NULL 
211 IF(P) = NOT(NOT(P)) 
-13 In the fourth line following LIKE_1 add a third alterna- 
tive to produce: 
LIKE = LIKE | T1 T2 | T1 LEN(1) T2 


8.15 either parenthesis 


8.17 
QLIT = Q BREAK(Q) Q 
CMNT = */** ARBNO (NOT ('*/*) LEN(1)) **7* 
ELEM = QLIT | CMNT { NOT(Q | '/*') LEN(1) BREAK('/;' Q) 
PLI.STMT = POS(0) (ARBNO(ELEM) ';') . STMT 
8219 
DEFINE (‘NAME (NO) D,X *) 3 (NAME_END) 
NAME NO LEN(1) .D = : F (RETURN) 
' 2ABC 3DEF4GHI 5JKL6MNO7 PRS8TUVOWXY0ZZZ 1**** D LEN(3) . X 
NAME = NAME ANY(X) 3 (NAME) 
NAME_END 
Bisstssressssssssesss= Chapter 9 Sse ssssss res SSssrsesssa 
9.1 
DEFINE('tREAD (P) ') : (READ_.END) 
READ LT (NF_INPUT, 0) :S(FRETURN) 
READ = POP() : S(READ_1) 
READ = INPUT : F(READ_ 2) 
READ_1 READ  P : S(RETURN) 
PUSH (READ) : (FRETURN) 
READ_2 NF_INPUT = NF_INPUT - 1 : (READ) 


READ_END 
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9.3 The following will remove blanks except within string 
literals as defined in the exercise. To handle 'real' Fortran 
we must be-a bit more sophisticated. See BLANKS, Prog. 18.3. 


Before returning, execute the following code. “The patterns 
can (and perhaps should ke) defined out of line. 
Q = were : QQ = tHe 


QLIT = Q BREAK(Q) Q | QQ BREAK(QQ) QQ 
HOL = SPAN('0123456789') $N ‘H' LEN(#N) 
PAT = POS(0) ARB. T1 NULL . T2 
+ (SPAN(' ') { (QLIT | HOL) . 12) 
FORTREAD LEN(6) . T = 
FORTREAD_2 FORTREAD PAT = ?F (FORTREAD_3) 
T = T T1 =T2 : (FORTREAD_2) 
FORTREAD_3 FORTREAD = T FORTREAD : (RETURN) 


The above will not handle the rare case that the integer 
preceding the H in a holerith literal contains interspersed 
blanks. This can be handled as follows (take extra credit if 
you did this): 

HOL = SPAN('0123456789 *) $N ‘¢H!' LEN(*DIFF(N,* ')) 


9.5 The following rendition of ASMREAD assumes that the READ 
routine removes comments. : ; 

DEFINE ('ASMREAD ()A,T*) 

CONTINUE TAB(71) . T NOTANY(* ¢) 


CCONTINUE16 = DUPL(* ',16) CONTINUE 
ORDINARY = TAB(71) .T 
ORDINARY16 = DUPL(' ‘,16) ORDINARY +: (ASMREAD_END) 
ASMREAD A = READ(CONTINUE) T 3S (ASM_1) 
3 ASMREAD = READ(ORDINARY) T :S (RETURN) F (FRETURN) 
ASM_1 A = READ(CONTINUE16) A T 2S (ASM_1) 
A = READ(ORDINARY16) A T :F (RETURN) 
ASMREAD = A : (RETURN) 


ASMREAD_END. 
9.7 (a) S POS(C - 1) LEN(L) . A = LPAD(TRIM(A) ,L) 


(b) To convert X's in S to number pairs write: 


LOOP S BREAK('X') @K SPAN(*X"') . X @L : F (DONE) 
PAIRS = PAIRS ‘'(' N+#K ',* SIZE(X) ‘')! 
N = N#¢ttL 3 (LOOP) 
DONE 
The rest is straightforward. 
9:9 (a) 
PEEL.K2. = POS(0) TAB(*K1.) (ANY(AFTER) @K2. | 
+ LEN(1) FASTEAL(,'"! ##, BEFORE AFTER) 
+ (@K2. ANY(BEFORE) { ANY(AFTER) 0K2.) 
+ | REM @K2.) 
(b) Make AFTER, BEFORE and C temporaries to PEEL. Define 


PEEL.K2. with unevaluated expressions *AFTER and *BEFORE in 
place of AFTER and BEFORE respectively. Replace the branch to 
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PEEL_1 in the first statement of PEEL to PEEL_3; also change 
the branch to ERROR by a branch to PEEL_3. PEEL_3 is defined 


as; 
PEEL_3 K1. = 0 
ts ,)>' BEFORE LEN(1) .C : F (ERROR) 
BEFORE = BEFORE C 
'= ,(<' AFTER LEN(1) .C 
AFTER = AFTER Cc 3 (PEEL_1) 
9214 
NONID = NOTANY ( ‘ABCDEFGHI JKLMNOPORSTUVWXYZ0123456789_.') 
L1 X = SNOREAD() :F (END) 
L2 X (NONID ARBNO('_.')) . N ‘ALPHA(' = 
+ N ‘ALPHANUMERIC (* 2S (L2) 
SNOPUT (X) 2(L1) 
END 
Sestssessssssssesse=2e= Solutions S2ssaesresrsssrsssze== 
Storss2essseesse=zs==2= for Seeceresssserssse2e==s= 
Srosesssessessssserese= Chapter 10 SSSSS>=SS==55=SS=SS== 


ee se SS Ss Se a a a a a a a a a ae a a a a a a ie i re 
i i 2 Sh SS td 


410.1 In the line after BNORM_1 change the go-to field to 
: (FRETURN) S(RETURN) and in the line labeled BNORM_UNB- change 
the go-to field to : (FRETURN). 


410.3 If there is an inversion then the spacing between the 
two characters must be < -2. But no string can have a spacing 
this negative unless it contained a double BSPACE. ; 


10.5 
22.3 NB = NOTANY(BSPACE) 
INORM (S,) (POS(0) { NB) INORM(S,) (NB | RPOS(0)) 
10.7 
PR_POS = PpOS(0) @N BREAK(BSPACE) @N FAIL { 
+ POS(0) *NE(N,0) TAB(*(N - 1)) . 81 
+ (LEN(1) ARBNO(BSPACE LEN(1)) . C1 
+ (NOTANY(BSPACE) | RPOS(0)) . C 


10.9 (a) Change the line UF1 = LT(UF1,0) -~UF1 
to UF1 = LT(UF1,0) (-2 * UF1) 


(b) Modify 
UF1T = cCW-wW 
UF1 = LT(UF1,0) -UF1 
UF1 = UF1 + SIZE (HYPHEN) 
to 
UF 1 UF_P * (CW - W) 


UF1 = LT(UF1,0) - (UF_C * (UF1 / UF_P)) 
UF1 = UF1 + UF_H * SIZE(HYPHEN 
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a { (a) | (b) 
k = { value HYPHEN { value HY PHEN 
0 ae oe we ae we me fa rr re fr rrr err ern 
2 { & - { 9 null 
4 1 8 - ] 9 null 
6 t 8 - { 9 null 
8 1! fails not set | 9 null 
10.13 
Replace DIGRAMS = 'XA,7~(0)B, .«- 
Replace DIGRAM TBL = . TABLE(30) 
by DIGRAM_PAT = AEORT 
' Replace DIGRAM_TBL<C> = ANY(CC} 
by DIGRAM_PAT = C FENCE ANY(CC) { DIGRAM_PAT 
In the pattern HYPH_PAT: | 
Replace FENCE ARB LEN(1) $C... 
by @K ABCRT 
Replace RWORD HYPH_PAT 2F (FRETURN) 
by 
RWORD HYPH_PAT 2S (HYPH_ 3) 
HYPH 2 .§ K= K ¢ 1° LT(K, SIZE(RWORD) - 1) :F (FRETURN) 
RWORD TAB(K - 1) DIGRAM_PAT :F (HYPH_2) 
HYPH 3 
10.415 (a) 
DEFINE(*PRIMAGE (S) I*) 
OUTPUT (.OVER, --- ) : (PRIMAGE_END) 
_PRIMAGE OUTPUT = INAGE(S,1) 
. OVER = IMAGE(S,0) 
PRIMAGE_1 : = I+1 
OVER = IMAGE(S,I *# 1) :S (PRIMAGE_ 1) F (RETURN) 
’ PRIMAGE_END 
(b) S1 = BNORM(S1) ; S2 = BNORM(S2) 
PRIMAGE(DUPL(' *,9) S1 DUPL(*® ',50 - SPACING(S1)) S2) 
10.17 . 
P = BNORM(P) 
LINE_INIT (F) 
LOOP LENGTHS PREAK(*,*") . CW '°*,? = : F (DONE) 
PRIMAGE(DUFL(*' *, (60 - CW) 7 2) LINE(CW)) =: (LOOP) 
L Ss t#et (9 (" BAL . K "y)* | LEN(1) © K) * 
+ = DUPL(* *, SIZE(K)) DUPI(BSPACE, SIZE(K)) K 
+ 


7S (L) 
S = BNORM (S) 


OUTPUT = IMAGE(S,2) 

OUTPUT = IMAGE(S, 1) 
peste sssssesscscssesscs Solutions seatsesssesSsesrssessse= 
maesmerssersess=sssses=2 for smensstsrsssesssses=e=2s 
stems essesescsssserss Chapter 11 Seteeersssesssessere=2 


i11.1 a-No, b-No, c-Yes, d-Yes, e-Yes, f-Yes, g~-No 
41.3 a-1, b-3, c-3, d-2, e-0, f-3, g-4, h-2, i-0 
AIs>. :3 I_ = 0! 


11.7 Recursive: F(1) = .164, F(n) = .140n + .006 
Iterative: F(1) = .126, F(n) = .096n + .030 


11.9 

OPSYN('CODE. ', 'CODE‘) 

DEFINE ("CODE (S) ') : (CODE_END) 
CODE :<CODE.(' CODENO = &STNO + 1. 3: (CODE_1)*)> 
CODE_1 CODE = CODE. (Ss) : (RETURN) 
CODE_END 


11.11 Write a routine CAPTURE(T1,S1) which is called by 
TPROFILE upon entry aS CAPTURE (TIME(), &LASTNO) 


so a 2 2 8 8 eS SS ae a ee a a ee ee ee 
tt tS tt 


Srtersstessertesessesss Solutions 
sstocssssessesrsases==s for 
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12.3 
RADIX = 0 
I= 0 : 
FACTOR = 1 
LOOP Vv BREAK(',*) . V1 LEN(1) = :F (DONE) 


FACTOR = FACTOR * RADIX 
I = Vi * FACTOR + I : (LOOP) 


12.5 Add 1 to the number associated with the record 1, 2, 3, 
eco, N-1 to obtain 


1 + 1*1 + 2*2! + 3431 + 2... + (n-1)*(n-1)! 
Note that kt! + k*k! = (k+1)! so that the first two terms 
keep collapsing until only one term is left, viz. nr! 


12.7 1,0,null string, I 


12.9 PERMUTATION(S, 6 * 5 * 4 * 3 * 2 = 1) 
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12.11 (a) The statements which need mwodification are: 


N 
I 


REMDR (1, RADIX) 
I / RADIX 


(b) Perform ‘short division' on the string. The function 
below will divide a string by. an integer and return the 
quotient. Ris a global variable set to equal the remainder. 


DEFINE (*DIVIDE(S,I)*) : (DIVIDE_END) 
DIVIDE R = 
DIVIDE_1 Ss LEN(1) .T = $F (RETURN) 
R = R T 
DIVIDE = DIVIDE (R / fT) 
R = REMDR(R, I) : (DIVIDE_1) 
DIVIDE_END 
So the two statements may ke replaced with: 
I = YDIVIDE(I, RADIX) 
N = R 


12.13 After PERM_INIT insert the. statement: 


(EQ(SIZE_A,1) DEFINE (*FERM(A) ',*PERM_F*)) 3S (RETURN) 


12.15 

Change: . SIZE_A = +#PROTOTYPE (A) 

To: SIZE_A =. SIZE(G_S) 

Change: SWAP (.A<AL>, .A<AL + D>) 

To: G_S POS(AL + D - 1) LEN(2) .T = REVERSE (T) 


12.17 (a) 100, (b) 20 


12.19 (1) At the entry point, put in an explicit check for 
the null string in order to break recursion. (2) Obtain C 
from SALPHABET as follows: 

REVERSE (&ALPHABET) ANY(S) . C 
(3). Remove the statement at REORDER_1 and shift the label to 
the next statement. (4) Remove the second parameter from the 
function definiticn and from the recursive call. 


12.21 All reorderings. The function has no memory so that if 
it produced, say, ‘ABBC* twice, as it would have to do if it 
produced all permutations of ‘ABBC', then it would never 
produce anything else. 


12.23 (a) P, (b) P, (c) I, (qd) I. 
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er ee aS a a ae a a a a a a a an a a a a na a a a a a a a a ee a es ee 
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13.1 The 2 instructions starting with BSORT_2 constitute the 
inner loop. An improvement is to add an instruction 

V1= A<K> 
and use V1 in place of A<K> in two places. This saves one ar- 
ray reference but adds an assignment statement; it is faster 
but just barely. 


13.3 Replace the two RETURN's by transfers to HSORT_X. Then 
replace the two calls to HSORT by the following instructions: 
PUSH(I) 3; PUSH(K) 


I= Kei : (HSORT) 
HSORT_X N =  POP() :F (RETURN) 

I = POP() : (HSORT) 
13.5 

DEFINE ('GRTH (X,Y) °) 
GRTH GT(X,Y¥ + R) : S (RETURN) F (FRETURN) 
GRTH_END 

I = MSORT(A, 'GRTH') 

A = AI(A,1I) 
13.7 MSORT(A, 'LT!) 


9 Add one more alternand: 
SS_PAT = ..2- | RPOS(O) .T 


13.11 Add the statement LSON(T) = NULL before LIN_1. 


213) (a) 2(n+1) (172 + 1/3 + 22. + I/(nt1)) = 2n 
(b) 21n2 = 1.38 


2S ae eS Se a ae a a ee ee ee eee 
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21 = (a) MAX(X,Y) will fail if X < Y. (6) Append a semicolon 
{;) to the argument. 


14.3 Change the : (RETURN) to :S(RETURN) and add the following 
two statements: 

OUTPUT = CODE 

CODE(LBL ' : (FRETURN) ') : (RETURN) 


te 


14.5 
<Definition of LOADEX function> 
3 (START) 
L1 LOADEX (*L1*) : (L1) 
L2 LOADEX ("12") : (L2) 
L100 LOADEX (*L100') 3: (L100) 
START 


14.7 Makes no difference. 


14.9 Replace 

_ PUSH (& ANCHOR) eee SANCHOR = 0 ..2£- &ANCHOR = POP() eee 
by 
PUSH (ARB) ... ARB = SARB ... ARB = POP() eee 


14.11 The names used by both packages to name identical 
operations must not. be the same. Thus 
REDEF TNE ('+', *CSUM (X,Y) ') would be OK for complex sum, but not 
REDEFINE ('4+*, *SUM(X, Y) '). 


14.13 
DEFINE (*F. (X) ") | 
OPSYN('F', 'F.') : (F_END) 

F. Fo= X : (RETURN) 

F_END 

14.15 
REDEFINE(' ', ‘CAT(X,Y) *) 

CAT CAT = -X¥() X * Y 2S (RETURN) 
CAT = CAT. (X,Y) : (RETURN) 

14.17 


OPSYN('OPSYN.', 'OPSYN') 
DEFINE ('OPSYN (NAME1, NAME2) *) 
- OPSYN (* DEFINE. ', "DEFINE ') 
DEFINE. ("DEFINE (PROTO, LBL) NM*) 
DEFINE (' FUNCTION (NAME) ') 
FUNC_LIST = *, OPSYN. , OPSYN, DEFINE, ' 
: (FUNCTION_END) 
DEFINE PROTO BREAK(*("') . NM 


FUNC_LIST = FUNC_LIST NM '!,' 
. DEFINE. (PROTO, LBL) 3: (RETURN) 
FUNCTION FUNC_LIST ‘,* NAME ',* | :S (RETURN) F (FRETURN) 


OPSYN FUNC_LIST = FUNC_LIST NAME1 ',* 
OPSYN. (NAME1, NAME2) 2 (RETURN) 


FUNCTION END 
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15.1 

DEFINE ('COMB(N,M) K*) 3 (COMB_END) 
COMB COMB = 1 
COMB_1 EQ(K,M) 2S (RETURN) 

K = K+#t1 . 

COMB = COMB * ((N - M) + R) 7 K > (CCMB_ 1) 
COMB_END 


15.3 COMB(L,N) - 1 ° 
15.5 (a) DIFF DIFF = SUM(X,MINUS(Y)) =: (RETURN) (b) 5 


15.7 Before the first of the SPLITs insert 
DIV = LE (SUBSTR(Y,1,1), 5) Xx *2S/Y* 2 


15.9 X > YS  (CEIL(Z) + 1) 


15.11 (a) E = e2 / 2(e + 1) (b) 5 


15.13 A= 1, 2, 4, 5 (integers). 


15.15 (a) 
ASIN(X) = 2 * ASIN(SCRT((1 - SQRT(1 - X2)) / 2)) 
(b) the same as the stopping criterion for SIN(A) 


15.17 105 
15.19 
| N = . CONVERT (LOG (X,2), ‘INTEGER') + 1 
X = XS (2 ** N) 
I = CONVERT(X * 2 ** 27, ‘INTEGER') 


15.21 The difficulty is that NAT_BASE is single precision. 
Replace the second occurrence of NAT_EFASE by EXP(X / X). 


ed ee ee de ee ee er ee 
a I Te 
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16.1 R() = ID(RANDCM(0)) 


16.3 Let HA = LEN(5). ‘Then the follewing statement will ex-. 
ecute the deal. 
RPERMUTE (DECK) HA . P1 HA .P2 HA. P3 HA. P& 
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16.5 The last one. Instead of assigning CODE(CODE) to a 
table, simply go to it. The first two statements could also 
be eliminated. 


16.7 In general, any string not containing a balancing right 
bracket to a left bracket will cause looping. One example is 
'('. The cure is to prefix the pattern LEN(1) to LITERAL.TEXT. 


16.9 Let %C be equivalent to C where C is some character. 
Thus €] is equivalent to | and %% is equivalent to %. 
Implementation is simple: 

LITERAL. TEXT = POS(0) '%* LEN(1) . TEXT | 
+ BREAK (*<=(%") . TEXT 


16.11 The probability P must satisfy the equation: 2P = 1 + 
P3, The solutions to this equation are 1, .616, and -1.62. 
The value 1 is unsuitable because the situation is clearly 
worse than the case where it just barely halts. -~1.62 is not 
a probability. Hence, by elimination, P = .616 


16213 (a) 
LOOP N= N+#1 

NUM = LT(RANDOM(),RANDOM() ** 2) NUM + 1.0 

OUTPUT = EQ(REMDR(N,100),0) N's * (NUM / N) 3 (LOOP) 
(b) + .94/SORT(N) 


16.15 Replace the rule that begins ‘OUTS = GT(' by simply the 
predicate to obtain the statement: 
GT (K,H(S)) - :S(RS_OUT) 
Then at RS_OUT insert: 
RS_OUT ADV = LT(RANDOM(),E) '123R° 2S(RS_4) 
OUTS = oUTS + 1 


16.17 In the program which follows, FORMAT will format a 
string for output; MIRIM will return the mirror image of any 
given sequence of positions and RSTEP will move half the 
dancers one random step forward making sure no conflicts occur 
among the dancers or their mirror images. 


DEFINE (* FORMAT (S)C') : (FORMAT_END) 
FORMAT S LEN(1) . Cc = :F (RETURN) 
_ FORMAT = FORMAT * ¢ ¢ : (FORMAT) 
FORMAT_END 
DEFINE ('MIRIM (POS) *) : (MIRIM_END) 


MIRIM MIRIM = REPLACE (POS, ‘ABCDEFGHIJKIMXYZ', 
+ "DCBALHYFEMLKJZGX') : (RETURN) 
MIRIM_END 


DEFINE (*RSTEP (CPOS) P,NPS,NP*) 

NEXT_POS = ‘A(ABPEF) B(ABCF) E(AEFJX) F (ABEFJK) J (EJFKX) * 
+ "K (SFKXGL) XK (EJXR) Y (KYL) 

NEXT_POS = NEXT_POS MIRIM(NEXT_POS) 3: (RSTEP_END) 
RSTEP CPOS ULEN(1) . P = 3 F (RETURN) 

NEXT_POS P '*(* ARB .NPS ‘')! 

NPS = RPERMUTE (NPS) 


RSTEP_1 NPS LEN(1) . NP = :F (FRETURN) 
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"XZ" = =-NP :S(RSTEP_2) 

(RSTEP MIRIM(RSTEP)) NP :S(RSTEP_1) 
RSTEP_2 RSTEP = RSTEP NP : (RSTEP) 
RSTEP_END 

OUTPUT = FORMAT('12345678"') 

. POS = 'XXxXx? ; 

LOOP OUTPUT = FORMAT(POS MIRIM(POS)) 

POS = RSTEP (POS) 

N = LT(N,100) N+ 1 :S (LOOP) 


END 
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17.1 Assume for the moment that ONEWAY maps integers to in- 
tegers. The machine obtains a random number N1 and prints 
ONEWAY(N1). The player thinks of a number N2 and types it in. 
The machine initializes a random number generator with the sum 
N1 + N2. After the hand-is completely over and before the 
start of a new deal, the machine prints out N1 which enables 
the player to check on the machinets honesty. 


. 17.3 The game is ill-formed. From a decision graph stand- 
point there are an infinitude of nodes and every terminal 
state is avoided by A whose best interests lie in prolonging 
the game until B's wallet is exhausted. 


17.5 Variables which can't be used are those indicated as 
temporary. They all begin with 'Q' so that programs using 
QUEST should avoid them. As a precaution to their forgetting, 
one can insert 

QN POS(0) 'Q! :S (ERROR) 
after label QUESTP_1. 


17.7 After the check for '...' insert: 
QVP POS(0) LEN(1) . QC1 '-' LEN(1) . QC2 RPOS(0) 


+ :F (QUESTP_4) 


SALPHABET BREAK(QC1) BREAK(QS) : F (FRETURN) 
REVERSE (SALPHARET) BREAK(0C2) BREAK (QS) : F (FRETURN) 
EQ (SIZE (QS) , 1) 2S (QUESTP_3) F (FRETURN) 


QUESTP_4 


17.9 Replace J = 0 by LIST = MAX. Replace: 
J = J + 1 LT(JT,MAX) 

by 

LIST BREAK(,) « J , | (LEN(1) REM) . J = 

As a matter of aesthetics, the name ‘MAX could be changed. 


17.11 For both cases, 8 X 3 X 2 = 48 


17.13 3X 2=6 
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17.1 Add: EQ(V,1) <:S(TITM_4) immediately before TTTM_3. 


17 Replace SALPHABET by ORD_ALPHA which is defined as: 


17. 
. FULL_DECK LEN(13) . SA  LEN(13) . SB 
+ LEN(13) . SC LEN(13) . SD 
ORD_ALPHA = BLEND (BLEND (SA, SE) , BLEND (SC,SD) ) 
17.19 © . 
LOOP H = VALS (RHAND(13, 1)) 
OUTPUT = 4 * COUNT(H,*M') ¢ 3 * CCUNT(H,'L*) 
+ 2 * COUNT(H,*K*) +¢ COUNT(H,'J*) +: (LOOP) 
17.21 The problem lies with the FLUSH test. It should 


properly go after the test for a full house. Thus 2H 2H 2H 3H 
3H should be interpreted as a full house. The initial fairs 
test was inserted for speed. This cculd be left out, sim- 
plifying the result. 


' 17.23 Setting VALS = WV and doing a :(PR(2)) is good enough 

for a uniform distibution but won't distinguish between hands 

that contain the same pairs but differ in only the fifth card. 

Hence, replace the WV in the call to PR by the expression: 
PASEB (CONVERT ( (CONVERT (DECOMB (W V),"REAL') / COMB(13,2)) 

+ * 13 ** 2, "‘INTEGER'), 13) 


17.25 After HE_BETS insert: 
QUEST ("How much? /BET(1...BET) *) 
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18.1 One method is to insert integers rather than strings in- 
to the table. Thus, instead of inserting ‘'2F*', insert 
BASE10(*2F*, 16). Another, perhaps extreme, method is to com- 
bine all. elements of a table into a long string and use pat- 
tern matching to extract an element. 


18.3 PUNCH = CH(OP AC X A) (Using Prog. 2.7). 


18,5 The single quote (') cannot be used. The solution is to 
use the QUOTE function (Prog. 3.16). 


18.7 Assuming CRNAME() returns a unique created nare: 
IDTBL = TABLE() 
IDEN = ... S(*'ID*) 


eee 


S_ID T = POP{() 
- (DIFFER (IDTBL<T>) PUSH (IDTBL<T>) ) :S (NRETURN) 
IDTBL<T> = CRNAME() 


PUSH (IDTBL<T>) : (NRETURN) 
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18.9 
ce) = ce te 
L2 X = INPUT 2F<CODE(S ' 3 +: (DONE) *)> 
X '<* BREAK(*>*) . K #>* = K 
x teest = t = #t 
L X '<* BREAK(*>*) 2 K fo! = Qeet Kt eg :S(L) 
X = REPLACE(X,"{',*<") 
L1 X "t = Q ty? Q :S(L1) 
S = S X "38 3: (L2) 
DONE 


18.11 ALPHA( H) would be converted to ALPHA(**). 
NLSTMT = ee. *PUSH () BL 
STMT = IFSTMT { ASGNSTMT { NLSTMT 
18.15 Writing DNF is okvious. We then replace *E of ASGNSTMT 
(*E | *DNF(*assignment operator (=)', ‘EXPRESSION') 
Replace the BOOL of IFSTMT by 
(BOOL | *DNF(*IF keyword, *relation')) 


etc. 


18.17 ATPREE = BREAK(':,") (*,* § *: SPAN(*0123456789") $N 
+ ‘‘,* *EVAL(DUPL("*ATREE *,N))) 


18.19 
HERE POLISH ‘'COMMA:* SPAN (*0123456789"') . N ‘',! 
+ ARB_TREE . T *CCMMA:2* = *CCOMMA:* (N # 1) T :2S(HERE) 


18.21 At TR_REF, after extracting the ID, apply the predicate 
ATEST(ID). If this fails, branch to TR_FREF defined as 
follows. 


TR_FREF POLISH POS(0) ‘*COMMA:2,* = :F (TR_FREF1) 
TR TR() ‘ARG,* FOP() ‘'//* : (TR_FREF) 
TR TR() ‘'ARG,* FCP() ‘ss? 


tr | 
wa 
t 


TR_FREF1 TR 


TR TR "CALL," ID ',,* PUSH(TEMP()) ‘*//! 
: (RETURN) 

18.23 

TU_ADD ;TU_MUL ;TU_FADD ;TU_FMUL 
ISREG(ARG1) 2S (TU_SUB) 
R = ISREG(ARG2) .  F(TU_SUB) 
OUTPUT = *¢* * OP * * R *,* ADDR(ARG1) 
DEASSOC (R) 
STORE (R, ARG3) : (RETURN) 


TU_SUB ;TU_DIV ;TU_FSUB ;TU_FCIV 
18.25 (a) 3, (b) 6, (c) Returns the successor of a number. 


18.27 #DEF,CS,<#DEF,S,#S;%1>;3 


APPENDIX on | 
( 1 
I i 


Cross-reference Listing of Functions 


Fru enn enc | 


i Is { 
{ Program Number References referenced { 
{ by I 
[StS tS Sr Se esee eae er aos ee Se sek as SSS Se eee Sere 1 
t | 
AGT 3.13 UPLO t 
{ AI 4.6 SEQ FRSORT t 
{ AOPA 4.4 SEQ | 
| { 
{ ARC 15.8 SQRT { 
( DEXP { 
t { 
{ ASM 18.1 BASEB { 
| RPAD | 
i ( 
| ASM360 8.11 { 
] I 
i BAL 8.3 PEEL | 
a . RS ENTENCE { 
| GPM { 
i ( 
j BALREV 3.8 REVERSE OR t 
{ _ . HYPHENATE | 
| \ 
{ BASEB 2.4 ONEWAY H 
i ASM { 
| { 
{ . BASE10 2.5 CH | 
t : ONEWAY { 
{ POKEV { 
{ 1 
( BCD_EBCDIC 2e2 | 
{ _ BLANKS 18.3 DIFF { 
! 1 
i BLEND 3.7 _ HEX { 
t LEXGT 1 
i INORM { 
t ONEWAY | 
1 { 


Cross-reference of functions 


———Page_461 


Program Number 


BRKREM 
BSORT 


CARDPAK 
CATA 
CEIL 
CH 


COMB 


DEXTERN 
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References 
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DEXP 
BASE10 


BREAKX 


COUNT 


COMB 


Is 


referenced 


by 


1 
{ 
4 
! 
' 
t 
] 
t 
( 
' 
' 
’ 
' 
' 
t 
' 
' 
t 
t 
{ 
t 
t 
t 
' 
t 
| 
' 
' 
‘ 
t 
' 
' 
t 
' 
' 
{ 
' 
' 
{ 
' 
' 
' 
‘ 
‘ 
1 
' 
' 
' 
' 
‘ 
‘ 


POKEV 
POKER 


DECOMB 
POKEV 


CRACK 
SPACING 
MINP 
FRSORT 


FRSORT 
POKEV 


CEIL 
TRIG 
ARC 
LOG 
RAISE 
PHRASE 
POL 
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| ; : Is { 
| Program Number References _ referenced { 
t by I 
a a aa aan acacia aa { 
1 { 
t. DIFF 3.10 SKIM { 
f LEXGT { 
{ BRKREM ! 
{ INORM { 
{ HY PHENATE { 
t BLANKS \ 
1 s \ 
| FASTBAL 8.4 SNOREAD { 
I FIND 4.5 i 
1 FLD 5.9 VISIT { 
{ FORPPUT) 9.8 PUT i 
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I. FORTREAD 9.2 READ { 
i FPROFILE 11.6 LPROG t 
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t AI I 
i MSORT \ 
{ STRINGOUT i 
1 CRACK ( 
| = | 
{ .  FTRACE 14.3 { 
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{ GFM 18.8 BAL { 
{ PUSH { 
| POP ( 
t . 1 
{ HEX 2.6 BLEND { 
I. HSORT 13.2 SWAP \ 
1 i 
i HYPHENATE 10.7 BALREV _ LINE { 
1. UPLO \ 
{ DIFF { 
{ . IMAGE 10.8 SPACING | 
l BREAKX | 
t a 1 
1 INFINIP 15.3 REDEFINE { 
{ . SWAP { 
{ LPAD \ 
1 { 
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t INSERTB 13.10 { 
l { 
| INSULATE 14.4 PUSH { 
{ POP l 
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| Ip 12.6 MSORT { 
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t L_ONE 18.2 PUSH TR { 
r POP { 
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4 
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